Analysis of StackOverflow posts/user data trend analysis. Predicting time to answer (classification) using Weka. CSCI599 final project on Social media data analytics
The document discusses stack overflow attacks and protection against them. It begins with an introduction to stack overflows, explaining that they occur when more items are pushed onto the stack than allocated space allows. This can overwrite adjacent memory and execute arbitrary code. Next, it describes the operation of the stack, how functions use stacks to store data and return addresses. The document then explains how attackers can exploit buffer overflows to manipulate the return address and inject shellcode to execute malicious code. Finally, it discusses some methods to protect against stack overflow attacks, such as bounds checking, address space layout randomization, and nonexecutable stacks.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a distributed publish-subscribe messaging system that allows both publishing and subscribing to streams of records. It uses a distributed commit log that provides low latency and high throughput for handling real-time data feeds. Key features include persistence, replication, partitioning, and clustering.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Multithreading allows programs to have multiple threads that can run concurrently. Each thread defines a separate path of execution. Processes are programs that are executing, while threads exist within a process and share its resources. Creating a new thread requires fewer resources than creating a new process. There are two main ways to define a thread - by implementing the Runnable interface or by extending the Thread class.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
So you know you want to write a streaming app, but any non-trivial streaming app developer would have to think about these questions:
– How do I manage offsets?
– How do I manage state?
– How do I make my Spark Streaming job resilient to failures? Can I avoid some failures?
– How do I gracefully shutdown my streaming job?
– How do I monitor and manage my streaming job (i.e. re-try logic)?
– How can I better manage the DAG in my streaming job?
– When do I use checkpointing, and for what? When should I not use checkpointing?
– Do I need a WAL when using a streaming data source? Why? When don’t I need one?
This session will share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/42Oo8TOl85I.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O has made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin's Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
The document discusses stack overflow attacks and protection against them. It begins with an introduction to stack overflows, explaining that they occur when more items are pushed onto the stack than allocated space allows. This can overwrite adjacent memory and execute arbitrary code. Next, it describes the operation of the stack, how functions use stacks to store data and return addresses. The document then explains how attackers can exploit buffer overflows to manipulate the return address and inject shellcode to execute malicious code. Finally, it discusses some methods to protect against stack overflow attacks, such as bounds checking, address space layout randomization, and nonexecutable stacks.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a distributed publish-subscribe messaging system that allows both publishing and subscribing to streams of records. It uses a distributed commit log that provides low latency and high throughput for handling real-time data feeds. Key features include persistence, replication, partitioning, and clustering.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Multithreading allows programs to have multiple threads that can run concurrently. Each thread defines a separate path of execution. Processes are programs that are executing, while threads exist within a process and share its resources. Creating a new thread requires fewer resources than creating a new process. There are two main ways to define a thread - by implementing the Runnable interface or by extending the Thread class.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
So you know you want to write a streaming app, but any non-trivial streaming app developer would have to think about these questions:
– How do I manage offsets?
– How do I manage state?
– How do I make my Spark Streaming job resilient to failures? Can I avoid some failures?
– How do I gracefully shutdown my streaming job?
– How do I monitor and manage my streaming job (i.e. re-try logic)?
– How can I better manage the DAG in my streaming job?
– When do I use checkpointing, and for what? When should I not use checkpointing?
– Do I need a WAL when using a streaming data source? Why? When don’t I need one?
This session will share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/42Oo8TOl85I.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O has made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin's Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
The document discusses lambda expressions in Java 8. It defines lambda expressions as anonymous functions that can be passed around as method parameters or returned from methods. Lambda expressions allow treating functions as first-class citizens in Java by letting functions be passed around as arguments to other functions. The document provides examples of lambda expressions in Java 8 and how they can be used with functional interfaces, method references, the forEach() method, and streams. It also discusses scope and type of lambda expressions and provides code samples demonstrating streams and stream pipelines.
This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.
The document discusses multithreading and threading concepts in Java. It defines a thread as a single sequential flow of execution within a program. Multithreading allows executing multiple threads simultaneously by sharing the resources of a process. The key benefits of multithreading include proper utilization of resources, decreased maintenance costs, and improved performance of complex applications. Threads have various states like new, runnable, running, blocked, and dead during their lifecycle. The document also explains different threading methods like start(), run(), sleep(), yield(), join(), wait(), notify() etc and synchronization techniques in multithreading.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
The document discusses virtual machines and JavaScript engines. It provides a brief history of virtual machines from the 1970s to today. It then explains how virtual machines work, including the key components of a parser, intermediate representation, interpreter, garbage collection, and optimization techniques. It discusses different approaches to interpretation like switch statements, direct threading, and inline threading. It also covers compiler optimizations and just-in-time compilation that further improve performance.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
A reflection-oriented program component can monitor the execution of an enclosure of code and can modify itself according to a desired goal related to that enclosure.
Reflection is one of those things like multi-threading where everyone with experience of it says “Don’t use it unless you absolutely have to”.
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server PushLucas Jellema
Fast data arrives in real time and potentially high volume. Rapid processing, filtering and aggregation is required to ensure timely reaction and actual information in user interfaces. Doing so is a challenge, make this happen in a scalable and reliable fashion is even more interesting. This session introduces Apache Kafka as the scalable event bus that takes care of the events as they flow in and Kafka Streams for the streaming analytics. Both Java and Node applications are demonstrated that interact with Kafka and leverage Server Sent Events and WebSocket channels to update the Web UI in real time. User activity performed by the audience in the Web UI is processed by the Kafka powered back end and results in live updates on all clients. Kafka Streams and KSQL are used to analyze the real time events in real time and publish events with the live findings.
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
The document provides an introduction to Gradle, an open source build automation tool. It discusses that Gradle is a general purpose build system with a rich build description language based on Groovy. It supports "build-by-convention" and is flexible and extensible, with built-in plugins for Java, Groovy, Scala, web and OSGi. The presentation covers Gradle's basic features, principles, files and collections, dependencies, multi-project builds, plugins and reading materials.
Agreement in a distributed system is complicated but required. Scylla gained lightweight transactions through Paxos but the latter has a cost of 3X roundtrips. Raft can allow consistent transactions without the performance penalty. Beyond LWT, we plan to integrate Raft with most aspects of Scylla making a leap forward in manageability and consistency
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
The document describes a Big Data project analyzing the Stack Overflow dataset. Key points:
- The dataset was obtained from Stack Exchange and consists of XML files totaling around 20GB that were parsed and loaded into HDFS.
- The data was analyzed to identify trending questions, unanswered questions, closed questions, dead questions, top tags, and more. PageRank analysis was also performed to rank posts.
- The results were visualized in a web application using technologies like Hive, HBase, Pig, and Mahout recommendation. Performance issues were encountered during the MapReduce parsing of the large XML files.
In this talk we briefly discuss some of our recent studies of Stack Overflow, a popular Q&A site targeting software developers. As opposed to studies of software artefacts discussed at Stack Overflow (e.g., APIs or programming examples), we focus on studying individuals active on Stack Overflow---who are they, what motivates them, and what affects their participation in Stack Overflow discussions.
Our findings indicate that Stack Overflow is no different from other communities of software developers in terms of gender representation but is significantly different from them in terms of gender engagement: controlling for engagement duration women and men ask and answer comparable number of questions, but women disengage faster. We conjecture that faster disengagement of women is the less pretty consequence of gamification mechanisms embedded in Stack Overflow, the same gamification mechanisms that provide developers with faster answers than ever before, attract numerous contributors and ultimately catalyse software development.
As an additional contribution we present genderComputer, a tool inferring gender of an individual based on her/his name and location.
The talk is based on the following papers:
* Gender, representation and online participation: A quantitative study, Vasilescu, B., Capiluppi, A. and Serebrenik, A., Interacting with Computers. 2013, Oxford University Press.
* How social Q&A sites are changing knowledge sharing in open source software communities, Vasilescu, B., Serebrenik, A., Devanbu, P. T. and Filkov, V., In CSCW, 2014, ACM.
* StackOverflow and GitHub: Associations between software development and crowdsourced knowledge, Vasilescu, B., Filkov, V. and Serebrenik, A., In Social Computing, 2013, IEEE.
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
The document discusses lambda expressions in Java 8. It defines lambda expressions as anonymous functions that can be passed around as method parameters or returned from methods. Lambda expressions allow treating functions as first-class citizens in Java by letting functions be passed around as arguments to other functions. The document provides examples of lambda expressions in Java 8 and how they can be used with functional interfaces, method references, the forEach() method, and streams. It also discusses scope and type of lambda expressions and provides code samples demonstrating streams and stream pipelines.
This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.
The document discusses multithreading and threading concepts in Java. It defines a thread as a single sequential flow of execution within a program. Multithreading allows executing multiple threads simultaneously by sharing the resources of a process. The key benefits of multithreading include proper utilization of resources, decreased maintenance costs, and improved performance of complex applications. Threads have various states like new, runnable, running, blocked, and dead during their lifecycle. The document also explains different threading methods like start(), run(), sleep(), yield(), join(), wait(), notify() etc and synchronization techniques in multithreading.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
The document discusses virtual machines and JavaScript engines. It provides a brief history of virtual machines from the 1970s to today. It then explains how virtual machines work, including the key components of a parser, intermediate representation, interpreter, garbage collection, and optimization techniques. It discusses different approaches to interpretation like switch statements, direct threading, and inline threading. It also covers compiler optimizations and just-in-time compilation that further improve performance.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
A reflection-oriented program component can monitor the execution of an enclosure of code and can modify itself according to a desired goal related to that enclosure.
Reflection is one of those things like multi-threading where everyone with experience of it says “Don’t use it unless you absolutely have to”.
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server PushLucas Jellema
Fast data arrives in real time and potentially high volume. Rapid processing, filtering and aggregation is required to ensure timely reaction and actual information in user interfaces. Doing so is a challenge, make this happen in a scalable and reliable fashion is even more interesting. This session introduces Apache Kafka as the scalable event bus that takes care of the events as they flow in and Kafka Streams for the streaming analytics. Both Java and Node applications are demonstrated that interact with Kafka and leverage Server Sent Events and WebSocket channels to update the Web UI in real time. User activity performed by the audience in the Web UI is processed by the Kafka powered back end and results in live updates on all clients. Kafka Streams and KSQL are used to analyze the real time events in real time and publish events with the live findings.
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
The document provides an introduction to Gradle, an open source build automation tool. It discusses that Gradle is a general purpose build system with a rich build description language based on Groovy. It supports "build-by-convention" and is flexible and extensible, with built-in plugins for Java, Groovy, Scala, web and OSGi. The presentation covers Gradle's basic features, principles, files and collections, dependencies, multi-project builds, plugins and reading materials.
Agreement in a distributed system is complicated but required. Scylla gained lightweight transactions through Paxos but the latter has a cost of 3X roundtrips. Raft can allow consistent transactions without the performance penalty. Beyond LWT, we plan to integrate Raft with most aspects of Scylla making a leap forward in manageability and consistency
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
The document describes a Big Data project analyzing the Stack Overflow dataset. Key points:
- The dataset was obtained from Stack Exchange and consists of XML files totaling around 20GB that were parsed and loaded into HDFS.
- The data was analyzed to identify trending questions, unanswered questions, closed questions, dead questions, top tags, and more. PageRank analysis was also performed to rank posts.
- The results were visualized in a web application using technologies like Hive, HBase, Pig, and Mahout recommendation. Performance issues were encountered during the MapReduce parsing of the large XML files.
In this talk we briefly discuss some of our recent studies of Stack Overflow, a popular Q&A site targeting software developers. As opposed to studies of software artefacts discussed at Stack Overflow (e.g., APIs or programming examples), we focus on studying individuals active on Stack Overflow---who are they, what motivates them, and what affects their participation in Stack Overflow discussions.
Our findings indicate that Stack Overflow is no different from other communities of software developers in terms of gender representation but is significantly different from them in terms of gender engagement: controlling for engagement duration women and men ask and answer comparable number of questions, but women disengage faster. We conjecture that faster disengagement of women is the less pretty consequence of gamification mechanisms embedded in Stack Overflow, the same gamification mechanisms that provide developers with faster answers than ever before, attract numerous contributors and ultimately catalyse software development.
As an additional contribution we present genderComputer, a tool inferring gender of an individual based on her/his name and location.
The talk is based on the following papers:
* Gender, representation and online participation: A quantitative study, Vasilescu, B., Capiluppi, A. and Serebrenik, A., Interacting with Computers. 2013, Oxford University Press.
* How social Q&A sites are changing knowledge sharing in open source software communities, Vasilescu, B., Serebrenik, A., Devanbu, P. T. and Filkov, V., In CSCW, 2014, ACM.
* StackOverflow and GitHub: Associations between software development and crowdsourced knowledge, Vasilescu, B., Filkov, V. and Serebrenik, A., In Social Computing, 2013, IEEE.
The document discusses a project to analyze Stack Overflow data to identify the next trending programming language. It states hypotheses that Stack Overflow data will be used to determine the trending language and why. It designed SAX parsers to handle the XML format data and assumes Java may be trending due to its widespread use. Elastic Search and Kibana will be used for analysis. Future plans include using the Stack Overflow API to collect live data through surveys and experiments.
- Study the architecture and design
- Compare Old & New Technology stack
- Analyze evolution of architecture and scalability
- Lessons learned over time
This document describes an assignment to analyze response times for questions on Q&A websites like StackOverflow. Specifically, it involves analyzing questions from a gaming StackExchange dataset and calculating features related to tags for each question, including tag popularity, number of popular tags, number of active subscribers to tags, and percentage of subscribers who are active. Response times will then be plotted against these tag features to analyze correlations. The deliverables are to calculate the features for each question programmatically and generate box plots and CDF plots of response times for each feature.
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)Ontico
Stack Overflow, and its Q&A network Stack Exchange, have been growing exponentially for the last five years. They now encompass
~150 Q&A sites
~9 million users
~13 million questions
~22 million answers
In this talk, I will describe:
+ The physical architecture of Stack Overflow. How many servers are there? What is their purpose and what are their specs?
+ The logical architecture of the software. How do we scale up? What are the main building blocks of our software?
+ The tooling system. What supports our extreme optimization philosophy?
+ The development team. What are our core values? What footprint do we want to leave as developers?
This document discusses the use of social media in software engineering. It explores several social media channels used in software development, including Q&A websites like Stack Overflow, microblogging platforms like Twitter, and developer blogs. For each channel, it summarizes relevant research that looks at how developers use these tools, what kinds of content and discussions they facilitate, and their potential benefits and risks. The document also considers how social media could enhance tasks like collaboration and learning, but also reverse trends like reliance on formal documentation or obsolesce tools like email lists. It concludes that while software development is becoming more collaborative, the impacts of social media are still unknown and require more research.
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Margaret-Anne Storey
Audio+slide video is posted at http://margaretannestorey.wordpress.com.
Slides from a Keynote at Mining Software Repository Conference 2012, co-located with ICSE 2012 in Zurich, Switzerland.
El documento describe un sistema de repositorio institucional basado en el software DSpace para almacenar y hacer accesibles las investigaciones realizadas en la UNAN-Managua. El repositorio permitiría incrementar la visibilidad y el prestigio de la universidad al proveer acceso público a las investigaciones producidas. El software DSpace, de código abierto, ofrece características como almacenar diferentes tipos de documentos digitales, organizarlos en comunidades y colecciones, y permitir el control y seguimiento de versiones.
This document discusses various topics related to programming efficiently in Groovy and Grails, including:
- Organizing classes into packages in Eclipse and importing dependencies
- The structure of a Grails project and where different types of code belong
- Automatically and manually generating controllers and views in Grails
- Using log4j for logging instead of println statements
- Examples of useful Grails plugins
- Tips for choosing and using Grails plugins
- Maintaining a clean coding style
Este documento presenta soluciones tecnológicas para recursos educativos abiertos (REA), incluyendo estándares para la producción, metadatos y plataformas interoperables. También describe el proyecto REDA del Ministerio de Educación de Colombia, el cual busca implementar un sistema nacional de REA mediante el desarrollo de una plataforma de repositorio y un integrador nacional.
What is Node.js used for: The 2015 Node.js Overview ReportGabor Nagy
This document discusses Node.js developers based on surveys from Moz.com and Stack Overflow. It provides statistics on Node.js developers such as their locations being mostly in the US, Canada, UK, Germany and India. It also discusses the technologies Node.js developers use most like JavaScript, PHP, AngularJS, MongoDB. Additional details are given on source control, work environments, and preferences around microservices architecture and coffee. The document promotes further insights into Node.js developers.
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...Paola Amadeo
Comunicando Moodle con un repositorio digital de objetos de aprendizaje abiertos.Una experiencia en la Facultad de Informática de la Universidad Nacional de La Plata. Argentina.
Autores: Javier Díaz, Alejandra Schiavoni, Alejandra Osorio, Paola Amadeo, M. Emilia Charnelli, José Schultz, Alex Humar, Agustina Reynoso
Here at MRM, we are delivering new and exciting work with
HTML5, CSS3, responsive web and cross platform solutions, but what does that really entail?
Hear the tech team explain recent work, current trends and future capabilities.
This document summarizes a project to visualize the relationships between questions in a network graph. It discusses using properties like node size, distance, and directed edges to represent the quality, similarity, and associations between questions. The team explored the data, performed cleansing and analysis, and used tools like R, Hive, and Sigma.js to generate the network graph and allow interactive exploration of the question relationships. The final product is deployed online for users to interact with the graph visualization.
OpenSource API Server based on Node.js API framework built on supported Node.js platform with Tooling and DevOps. Use cases are Omni-channel API Server, Mobile Backend as a Service (mBaaS) or Next Generation Enterprise Service Bus. Key functionality include built in enterprise connectors, ORM, Offline Sync, Mobile and JS SDKs, Isomorphic JavaScript and Graphical API creation tool.
This presentation describes the approach that I developed for Kaggle's WISE 2014 challenge. The challenge was about multi-label classification of printed media articles to topics. The main ingredients of my solution was a plug-in rule approach for F1 maximization, feature selection using a chi squared based criterion, feature normalization and a multi-view ensemble scheme.
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...Preetha Chatterjee
The document describes an exploratory study of mining information from Slack Q&A chat logs to support software engineering tools. The researchers analyzed conversations from several Slack communities to address how prevalent useful information is for tools (RQ1) and what challenges automatic mining faces (RQ2). Results found that while Slack contains similar information as Stack Overflow, identifying accepted answers is difficult. Future work is proposed to link Slack and Stack Overflow and mine conversations for insights.
Automatic Identification of Informative Code in Stack Overflow PostsPreetha Chatterjee
Despite Stack Overflow’s popularity as a resource for solving coding problems, identifying relevant information from an individual post remains a challenge. The overload of information in a post can make it difficult for developers to identify specific and targeted code fixes. In this paper, we aim to help users identify informative code segments, once they have narrowed down their search to a post relevant to their task. Specifically, we explore natural language- based approaches to extract problematic and suggested code pairs from a post. The goal of the study is to investigate the potential of designing a browser extension to draw the readers’ attention to relevant code segments, and thus improve the experience of software engineers seeking help on Stack Overflow.
This document provides an overview of the ICS 314 and 613: Software Engineering course taught by Philip Johnson. It outlines the instructor's background and contact information, goals of the course, what constitutes "quality" software, open source development principles, standards and feedback, course structure, prerequisites, grading, differences between 314 and 613, lectures and labs, quizzes, engineering log requirements, developing a professional persona, collaboration vs. cheating policies, and lessons learned from past students.
18 css101j pps unit 1
Evolution of Programming & Languages - Problem Solving through Programming - Creating Algorithms - Drawing Flowcharts - Writing Pseudocode - Evolution of C language, its usage history - Input and output functions: Printf and scanf - Variables and identifiers – Expressions - Single line and multiline comments - Constants, Keywords - Values, Names, Scope, Binding, Storage Classes - Numeric Data types: integer - floating point - Non-Numeric Data types: char and string - Increment and decrement operator - Comma, Arrow and Assignment operator - Bitwise and Sizeof operator
This document provides an overview of a course on operating systems. It includes sections on course organization, objectives, outline, laboratories, assessments, exams, policies, and an introduction to key topics covered in the course like OS types, multiprogramming, scheduling algorithms like FCFS, SJF, and round robin. The course aims to provide a comprehensive tour of major OS components and concepts through lectures, labs, assignments, and exams.
Information-Rich Programming in F# with Semantic DataSteffen Staab
Programming with rich data frequently implies that one
needs to search for, understand, integrate and program with
new data - with each of these steps constituting a major
obstacle to successful data use.
In this talk we will explain and demonstrate how our approach,
LITEQ - Language Integrated Types, Extensions and Queries for
RDF Graphs, which is realized as part of the F# / Visual Studio-
environment, supports the software developer. Using the extended
IDE the developer may now
a. explore new, previously unseen data sources,
which are either natively in RDF or mapped into RDF;
b. use the exploration of schemata and data in order to
construct types and objects in the F# environment;
c. automatically map between data and programming language objects in
order to make them persistent in the data source;
d. have extended typing functionality added to the F#
environment and resulting from the exploration of the data source
and its mapping into F#.
Core to this approach is the novel node path query language, NPQL,
that allows for interactive, intuitive exploration of data schemata and
data proper as well as for the mapping and definition
of types, object collections and individual objects.
Beyond the existing type provider mechanism for F#
our approach also allows for property-based navigation
and runtime querying for data objects.
This document outlines the development of an app called ASMS C.A.N. (Calling all Nerds) which will allow students to anonymously post questions for other students and teachers to answer. By week 7, the developers aim to have a working prototype and complete frameworks. The project leader is Nathan Towell and collaborators include Maja and David McAfee. Key targets are outlined on a weekly basis through week 7, including completing the database framework by week 1 and having a basic prototype by week 3.
Just Click on Below Link To Download This Course:
https://www.devrycoursehelp.com/product/devry-cis-355a-full-course-latest/
Go to a job posting site (CareerBuilder, Dice, ComputerJobs, etc.) or use search engines to find Java developer or Java programmer positions. Copy and paste the job posting into the Discussion area. Briefly explore all the topics that you will learn in this class this session. What are the skills you will learn in this course that are also requirements for the positions you see posted by you and your classmates?
Oplægget blev holdt ved et seminar i InfinIT-interessegruppen Højniveausprog til Indlejrede Systemer den 2. oktober 2013. Læs mere om interessegruppen her: http://infinit.dk/dk/interessegrupper/hoejniveau_sprog_til_indlejrede_systemer/hoejniveau_sprog_til_indlejrede_systemer.htm
This document provides an overview of a course on fundamentals of programming. It introduces the instructor, teaching assistant, and their contact details. It outlines the course credits, grade distribution, textbook, and reference book. It describes the course contents which will cover topics like introduction to programming, variables, operators, conditional statements, loops, functions, arrays, structures, pointers, and file I/O over 18 weeks. It discusses the course outcomes and expectations, attendance, assignment, quiz, and lab policies. Ground rules for student civility are also outlined.
This document provides an overview and introduction to a course on principles of compiler design. It discusses the motivation for studying compilers, as language processing is important for many software applications. It outlines what will be covered in the course, including the theoretical foundations and practical techniques for developing lexical analyzers, parsers, type checkers, code generators, and more. The document also describes the organization of the course with lectures, programming assignments, and exams.
This course introduces students to social and ethical issues related to the information technology profession. It discusses the impact of technology on society and the relationship between professionals and ethics. The course covers topics such as theories of social change, the definition of a profession, different ethical theories, and codes of ethics for IT professionals. The goal is to provide a general understanding of working in an IT profession from social and ethical perspectives.
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
Invited Talk at Summer School on Semantic Web, Bertinoro, 2015
Abstract:
Two decades ago one has discussed how to build seamless digital workflows
such that the medium for data in a workflow would not switch between paper, fax, phone,
and digital, because each transcription from one to another medium would
be laborious and cost-inefficient. Thus, the issue was avoiding *medium discontinuities*.
Today, we have all-digital data workflows, but we have still plenty of *semantic discontinuities*.
In this talk, I want first to describe reasons for this discontinuities including: autonomy of
data providers, need for agility and flexibility, or decentralized organizations in
the world-wide data spaces.
Then I want to describe several semantics discontinuities and some efforts to
ameliorate them by:
1. Semantic programming (Horizontal workflow paradigm)
2. Core ontologies (Vertical workflow paradigm)
3. Semantic data production and consumption (Sticky semantics)
A successful startup requires the best possible talent. Great people are out there, but how do you find them? And how do you make them want to work for you? This session focuses on identifying the positions necessary for your startup to scale, attracting the best talent using limited resources, and making sure you have a plan in place to find the right people for the job.
Texas.gov Presents: Battle of Programming LanguagesTexas.gov
Software technology is evolving quickly. Platforms, programming languages, and frameworks are created at a pace faster than ever before. New techniques, processes, and standards are also emerging that can impact your organization, especially if you’re not prepared. There are times, though, that these same technologies also fade fast. Catching up with and staying abreast of new technologies is as important as knowing which technologies will yield long term results.
How can your organization's technology strategy keep up with all the changes? Join us for a light-hearted yet informative look at the “Battle of the Programming Languages” and our take on how to keep up with emerging technologies and techniques, and how you can align your organization's technology goals with the ever moving software industry.
Visualising the world of competitive programming with Python (Codeforces)Anuj Menta
This document discusses competitive programming and compares several competitive programming websites. It finds that Codeforces has the most programming questions overall and hosts weekly programming contests. Codeforces also uniquely makes all participant code public and uses a rating system similar to chess ratings. While other sites like Hackerrank and Hackerearth have additional topics, Codeforces focuses solely on programming contests and exposes participants to a wide range of languages and problem types. The document shows Codeforces has the most questions in most categories like sorting and dynamic programming. It also finds Python is mostly used for simpler implementation questions while other languages may be better for more complex categories.
Similar to Stack Overflow slides Data Analytics (20)
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
5. Introduction
• Dataset Used : Stackoverflow - Internet Archive.
• Why? It is one of the largest developer focused
open collaborative platform currently.
• Through our study we intend to answers some
interesting questions
• Study the rise and fall of popular programming
languages
• Can be used to predict future enhancements
• Study effectiveness of Stack Overflow model
5
6. Problem Definition
• Basic Analysis: What are the most popular
programming languages?
• What are the trends in programming languages?
• What are the most popular topics discussed in a
programming language?
• Can we accurately predict the time it takes until a
questioner gets an answer?
6
7. Related Work
Miltiadis Allamanis and Charles Sutton. 2013. Why, When and What: Analyzing Stack Overflow Questions by
Topic, Type & Code In 10th Working Conference on Mining Software Repositories. Mining Challenge. IEEE, pages 53-
56.
• Topic modeling analysis
• Used Latent Dirichlet Allocation (LDA)
• Modeled Java Topics of Questions
• Can evaluate the orthogonality of different languages
• Stack Overflow questions are about the code and are not application domain specific
7
8. Related Work
V. Bhat, A. Gokhale, R. Jadhav, J. Pudipeddi, and L. Akoglu. Min (e) d your tags: Analysis of question response time in
stackoverflow. In Proceedings of ASONAM 2014, pages 328–335. IEEE, 2014.
• Two linear classifiers: logistic regression and SVM with linear kernel
• Two non- linear classifiers: decision tree (DT) and SVM with radial basis function kernel
8
22. Approach
Following attributes selected for study:
1. Tag (Only top 10 programming languages)
2. Creation Month
3. Body Length
4. Tag Length
5. Introduced new Nominal class - Time_Answer
6. (less6, bet6and20, 20andmore)
22
23. Approach
Tools
Weka - Weka is a collection of machine learning
algorithms for data mining tasks.
Data Preprocessing:
Challenging!
23
24. Approach(Data Pre -processing)
1. Parse all the answers and link first answer’s creation
time to creation time of question. We called this
field delta-answer.
2. Remove all the Questions which had delta answer
negative or zero
3. We developed a Python script which develops .arff
file On the fly (Wish to contribute this file)
24
25. Evaluation
• Subset Size: 4490947 - Subset - 449000
• Classify response time into 3 types: less than 6
minutes, between 6 and 20 minutes, 20 minutes
and more.
• 10-fold cross-validation
• Results are obtained using different feature
combinations and different classifiers
25
28. Summary
• We were successfully able to find interesting
temporal trends for major programming languages
• Using tag based topic analysis we were able to find
major discussion topics and to some extent the
difficult topics in a programming language
• Using machine learning techniques we were
successfully able to predict - time to answer with
good accuracy
28
29. Future Scope of Work
• Contribute the .arff on the fly generator script.
• Adding Parts of speech as an attribute
• Showcasing the results on a website
29