2. What is
Openware?
• Openware refers to the tools which are freely
available to the public for use, modification,
and distribution. This means that the
underlying code of the software can be
accessed, studied, modified, and shared by
anyone.
• The concept of open source promotes
collaboration, transparency, and community-
driven development. It encourages
developers from around the world to
contribute improvements, fix bugs, and
create new features, leading to rapid
innovation and often higher-quality software.
3. • Open source software is software developed and maintained via open collaboration, and
made available, typically at no cost, for anyone to use, examine, alter and redistribute
however they like. This contrasts with proprietary or closed source software applications—
e.g. Microsoft Word, Adobe Illustrator—which are sold to end users by the creator or
copyright holder, and cannot be edited, enhanced or redistributed except as specified by the
copyright holder.
• The term open source also refers more generally to a community-based approach to
creating any intellectual property (such as software) via open collaboration, inclusiveness,
transparency, and frequent public updates.
• Open source tools can be found in various domains, including operating systems (like
Linux), web browsers (such as Mozilla Firefox), office suites (like LibreOffice),
programming languages (like Python), and many other applications and tools. It's governed
by licenses that determine how it can be used, shared, and modified while ensuring that the
software remains open and accessible to all.
4. History of open-source software
• Until the mid-1970s, computer code was seen as implicit to the
operation of the computer hardware, and not unique intellectual
property subject to copyright protection. Organizations programmed
their own software, and code sharing was a common practice.
• The Commission on New Technological Uses of Copyrighted Works
(CONTU) was established in 1974 and concluded that software code
was a category of creative work suitable for copyright protection. This
fueled the growth of independent software publishing as an industry,
with proprietary source code as the primary source of revenue
5. • A rebellion of sorts against the restrictions and limitations of proprietary software
began in 1983. Programmer Richard Stallman chafed at the notion that users could
not customize proprietary software however they saw fit to accomplish their work.
Stallman felt that “software should be free–as in speech, not beer,” and
championed the notion of software that was freely available for customization.
• Stallman founded the Free Software Foundation and would go on to drive the
development of an open-source alternative to the AT&T-owned Unix operating
system, among other applications. He also innovated the first copyleft software
license, the GNU General Public License (GPL), which required anyone who
enhanced his source code to likewise publish their edited version freely to all.
• Because many felt that Stallman’s term “free software” inaptly emphasized “free
of cost” as the main value of the software, the term “open source” was adopted in
1999
6. Why users and companies choose
open source?
• Reasons for choosing open-source software can vary significantly from person to
person and organization to organization.
• In many cases, end users are completely unaware of the open-source programs on
their computers or mobile devices. It is also common for end users to download a
free application like the Mozilla Firefox browser, or an Android app. These users
simply want the software’s functionality, with no intention to rewrite or even look
at the source code.
• A company, on the other hand, might choose open-source software over a
proprietary alternative for its low (or no) cost, the flexibility to customize the
source code, or the existence of a large community supporting the application.
Professional or amateur programmers might volunteer their development and
testing skills to an open-source project, often to enhance their reputation and
connect to others in the field.
7. Data Analysis
• Data Analysis is the process of systematically applying
statistical and/or logical techniques to describe and
illustrate, condense and recap, and evaluate data.
• While data analysis in qualitative research can include
statistical procedures, many times analysis becomes an
ongoing iterative process where data is continuously
collected and analyzed almost simultaneously. Indeed,
researchers generally analyze for patterns in
observations through the entire data collection phase.
The form of the analysis is determined by the specific
qualitative approach taken (field study, ethnography
content analysis, oral history, biography, unobtrusive
research) and the form of the data (field notes,
documents, audiotape, videotape).
9. Importance of Data Analysis
• Data analytics help businesses understand the target market faster,
increase sales, reduce costs, increase revenue, and allow for better
problem-solving. Data analysis is important for several reasons, as it
plays a critical role in various aspects of modern businesses and
organizations. The importance of the data analysis-
Informed decision-
making
Identifying
opportunities and
challenges
Improving
efficiency and
productivity
Customer
understanding and
personalization
Performance
tracking and
evaluation
Predictive
analytics
Data-driven
innovation
Fraud detection
and security
Regulatory
compliance
10.
11. Data Analysis
Methods
Descriptive analysis involves summarizing and describing the
main features of a dataset, such as mean, median, mode, standard
deviation, range, and percentiles. It provides a basic understanding
of the data’s distribution and characteristics.
Inferential Statistics
• Inferential statistics are used to make inferences and draw
conclusions about a larger population based on a sample of data. It
includes techniques like hypothesis testing, confidence intervals, and
regression analysis.
Data Visualization
• Data visualization is the graphical representation of data to help
analysts and stakeholders understand patterns, trends, and insights.
Common visualization techniques include bar charts, line graphs,
scatter plots, heat maps, and pie charts.
Descriptive Statistics
12. Exploratory Data Analysis (EDA)
• EDA involves analyzing and visualizing data to discover patterns,
relationships, and potential outliers. It helps in gaining insights into
the data before formal statistical testing.
Predictive Modeling
• Predictive modeling uses algorithms and statistical techniques to
build models that can make predictions about future outcomes based
on historical data. Machine learning algorithms, such as decision
trees, logistic regression, and neural networks, are commonly used
for predictive modeling.
Time Series Analysis
• Time series analysis is used to analyze data collected over time, such
as stock prices, temperature readings, or sales data. It involves
identifying trends and seasonality and forecasting future values.
13. Factor Analysis and Principal Component Analysis (PCA)
• These techniques are used to reduce the dimensionality of data and identify
underlying factors or components that explain the variance in the data
Text Mining and Natural Language Processing (NLP)
• Text mining and NLP techniques are used to analyze and extract information from
unstructured text data, such as social media posts, customer reviews, or survey
responses.
Qualitative Data Analysis
• Qualitative data analysis involves interpreting non-numeric data, such as text,
images, audio, or video. Techniques like content analysis, thematic analysis, and
grounded theory are used to analyze qualitative data.
Quantitative Data Analysis
• Quantitative analysis focuses on analyzing numerical data to discover
relationships, trends, and patterns. This analysis often involves statistical methods.
14. Data Mining
• Data mining involves discovering patterns, relationships, or insights from large
datasets using various algorithms and techniques.
Regression Analysis
• Regression analysis is used to model the relationship between a dependent
variable and one or more independent variables. It helps understand how changes
in one variable impact the other(s).
Cluster Analysis
• Cluster analysis is used to group similar data points together based on certain
features or characteristics. It helps in identifying patterns and segmenting data into
meaningful clusters.
15. Openware tools for Data Analysis:- Business Perspective
Data
Analysis
•Gain insights
from data
•Optimize
processes
•Improve
decision
making
•Create value
for customers
•Improve data
security and
privacy
•Understand
their
customers
better
•Improve
sales
•Improve
customer
targeting
•Reduce costs
•Create better
problem-
solving
strategies
16. Openware tools are more adaptable and flexible than
proprietary software
Openware
•Affordable
•Transparent
•Long term use
•Helps in
developing skills
•More secure
•High quality
results
•Optimize
performances
•Reduce costs
•Customizable
•Improve
decision-making
and business
results
19. 1. APACHE SPARK
• Apache Spark is a lightning-fast, open-source data-processing engine for machine learning
and AI applications, backed by the largest open-source community in big data.
• Apache Spark (Spark) is an open source data-processing engine for large data sets. It is
designed to deliver the computational speed, scalability, and programmability required for
Big Data—specifically for streaming data, graph data, machine learning, and artificial
intelligence (AI) applications.
• Spark's analytics engine processes data 10 to 100 times faster than alternatives. It scales by
distributing processing work across large clusters of computers, with built-in parallelism and
fault tolerance. It even includes APIs for programming languages that are popular among data
analysts and data scientists, including Scala, Java, Python, and R.
20.
21. Features of Apache Spark
Fault
tolerance
Dynamic In
Nature
Lazy
Evaluation
Real-Time
Stream
Processing
Speed Reusability
Advanced
Analytics
In Memory
Computing
Supporting
Multiple
languages
Integrated
with Hadoop
Cost efficient
22. 2. KNIME
• KNIME (Konstanz Information Miner), an open-source, cloud-based, data integration
platform. It was developed in 2004 by software engineers at Konstanz University in
Germany. Although first created for the pharmaceutical industry, KNIME’s strength in
accruing data from numerous sources into a single system has driven its application in
other areas. These include customer analysis, business intelligence, and machine learning.
• Its main draw (besides being free) is its usability. A drag-and-drop graphical user interface
(GUI) makes it ideal for visual programming. This means users don’t need a lot of
technical expertise to create data workflows. While it claims to support the full range of
data analytics tasks, in reality, its strength lies in data mining. Though it offers in-depth
statistical analysis too, users will benefit from some knowledge of Python and R. Being
open-source, KNIME is very flexible and customizable to an organization’s needs—
without heavy costs. This makes it popular with smaller businesses, who have limited
budgets.
23. Features
of
KNIME
Scalability through sophisticated data handling
High, simple extensibility via a well-defined API for plugin
extensions
Intuitive user interface
Import/export of workflows
Parallel execution on multi-core systems
Command line version for "headless" batch executions
Parallel execution
24. 3. RAPIDMINER
• RapidMiner uses a client/server model with the server offered either on-premises or in
public or private cloud infrastructures. Rapidminer is a comprehensive data science
platform with visual workflow design and full automation.
• RapidMiner provides data mining and machine learning procedures including data
loading and transformation (ETL), data preprocessing and visualization, predictive
analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in
the Java programming language. RapidMiner provides a GUI to design and execute
analytical workflows. Those workflows are called “Processes” in RapidMiner, and they
consist of multiple “Operators”. Each operator performs a single task within the process,
and the output of each operator forms the input of the next one. Alternatively, the engine
can be called from other programs or used as an API. Individual functions can be called
from the command line. RapidMiner provides learning schemes, models and algorithms
and can be extended using R and Python scripts.
25. Features of
Rapidminer
• Environment for data analysis and machine learning process.
• Provides drag and drop interface to design the analysis process.
• Compatibility with various databases like Oracle, MySQL, SPSS etc.
• Uses XML to describe the operator tree modelling knowledge discovery process.
• Includes many learning algorithm from WEKA
• Specialised for business solutions that include predictive analysis and statistical
computing.
Advantages of
Rapidminer
• Offers enormous Procedures especially in the area of attributes selection and for
outlier detection
• Has the full facility for model evaluation using cross validation and independent
validation sets.
• Provides the integration of maximum algorithm of such tools.
• It has enormous flexibility.
26. 4. HADOOP
• Hadoop is an open-source framework for storing and processing large amounts of data. It's based
on the MapReduce programming model, which allows for the parallel processing of large
datasets. Hadoop is written in Java and is used for batch/offline processing.
• Hadoop uses distributed storage and parallel processing to handle big data and analytics jobs. It
breaks workloads down into smaller workloads that can be run at the same time. Hadoop allows
clustering multiple computers to analyze massive datasets in parallel more quickly.
Hadoop has three components:
• Hadoop HDFS: The storage unit of Hadoop
• Ozone: A distributed object store built on the Hadoop Distributed Data Store block storage layer
• Phoenix: A SQL-based transaction processing and operational analytics engine
27.
28. Key Features of Hadoop
Cost-
Effectiveness.
High-Level
Scalability.
Fault
Tolerance.
High-
Availability of
Data.
Faster Data
Processing.
Data Locality.
Possibility of
Processing All
Types of Data.
Faster Data
Processing
Machine
Learning
Capabilities
Integration
with Other
Tools
Secure
Community
Support
29. 5. PENTAHO
• Pentaho is business intelligence software that provides data integration, OLAP
services, reporting, information dashboards, data mining and extract, transform,
load capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired
by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.
• Pentaho is a Business Intelligence tool which provides a wide range of business
intelligence solutions to the customers. It is capable of reporting, data analysis,
data integration, data mining, etc. Pentaho also offers a comprehensive set of BI
features which allows to improve business performance and efficiency.
30.
31. Features of Pentaho
• ETL capabilities for business intelligence needs
• Understanding Pentaho Report Designer
• Product Expertise
• Offers Side-by-side sub reports
• Unlocking new capabilities
• Professional Support
• Query and Reporting
• Offers Enhanced Functionality
• Full runtime metadata support from data sources
32. 6. GRAFANA
• Grafana is an open and composable observability and data
visualization platform. Visualize metrics, logs, and traces from
multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB,
Postgres and many more.
• An open-source monitoring system with a dimensional data model,
flexible query language, efficient time series database and modern
alerting approach.
• It is a multi-platform open source analytics and interactive
visualization web application. It provides charts, graphs, and alerts for
the web when connected to supported data sources.
33. Grafana is an
open-source
data
visualization
and monitoring
tool. Some of its
features include
• Panels: The basic building block for visualization in Grafana.
Panels can contain graphs, tables, heatmaps, and more.
• Plugins: Grafana integrates with many popular data sources.
• Graph annotations: Allows you to mark graphs to enhance
your dataset's correlation.
• Dashboards: Present data in formats like charts, tables,
histograms, heat maps, and world maps.
• Alerts: Allows you to create, manage, and silence alerts within
one UI.
• Authentication: Supports different authentication methods,
such as LDAP and OAuth.
• Logs: Allows you to tail logs in real time, update logs after a
certain time, and view logs for a particular date.
• Reporting: Allows you to automatically generate PDFs from
any of your dashboards.
34. 7. BIPP
• Bipp is a business intelligence (BI) platform that helps organizations use data
to make faster and better decisions. Bipp is a cloud-based platform that allows
users to explore billions of records in real-time. It's built for data analysts and
simplifies SQL queries.
• BIPP is a modern, cloud business intelligence platform that lets you explore
billions of records in real-time. Simply connect your datasource, and build
reusable data models with BIPP’s Data Modeling Layer. Or explore your data
with Visual SQL Data Explorer and create charts and dashboards in minutes.
35. Bipp is a cloud-
based business
intelligence
platform that
allows users to
explore billions
of records in
real-time. Some
features of bipp
include:
• Data Modeling Layer: Allows users to build reusable
data models.
• Visual SQL Data Explorer: Allows users to explore
data and create charts and dashboards.
• Git: Records changes and manages file versions.
• Interactive dashboards: Can act like data
applications.
• Custom visualizations: Can meet unique needs.
• Real-time performance monitoring: Allows users to
monitor and measure performance.
• Dynamic window/analytic functions
• Views from legacy SQL
• Views using structured SQL
36. 8. CASSANDRA
• Cassandra was created in 2008 by Avinash Lakshman, who was responsible for scaling
Facebook's inbox search feature. It's used by big companies like Apple, which manages
100 petabytes of data across hundreds of thousands of server instances.
• Cassandra is a NoSQL distributed database that manages large amounts of data across
multiple servers. It's open-source, lightweight, and non-relational. Cassandra is known for
its ability to distribute petabytes of data with high reliability and performance.
• Cassandra is schema-free, supports easy replication, and has a simple API. It's also
eventually consistent and can handle huge amounts of data.
• Cassandra might not be the right database for many-to-many mappings or joins between
tables. It doesn't support a relational schema with foreign keys and join tables.
37.
38.
39. 9. Tableau
• Tableau is a popular data visualization tool that is used by businesses
of all sizes to quickly and easily analyze data. It allows users to create
dashboards and visualizations that can be used to share insights with
stakeholders. Tableau is also used by data scientists to explore data
with limitless visual analytics.
• Tableau is a powerful data visualization tool that helps businesses
derive valuable insights from their data. It allows users to create
interactive dashboards.
40.
41. Features of Tableau
• Tableau Dashboard
• Collaboration and Sharing
• Live and In-memory Data
• Data Sources in Tableau
• Advanced Visualizations
• Mobile View
• Revision History
• Licensing Views
• Subscribe others
42. 10.HPCC
• HPCC Systems, or High-Performance Computing Cluster, is an open source, data-intensive
computing platform for big data processing and analytics. It was developed by LexisNexis Risk
Solutions.
• The HPCC platform incorporates a software architecture implemented on commodity
computing clusters to provide high-performance, data-parallel processing for applications
utilizing big data. The HPCC platform includes system configurations to support both parallel
batch data processing (Thor) and high-performance online query applications using indexed
data files (Roxie). The HPCC platform also includes a data-centric declarative programming
language for parallel data processing called ECL.
• The HPCC system architecture includes two distinct cluster processing environments Thor and
Roxie, each of which can be optimized independently for its parallel data processing purpose.
43. HPCC Systems features
Data management and
analytics
Data profiling, data
cleansing, snapshot
data updates, and a
scheduling component
Query and search
engine
The Roxie cluster
contains a powerful
query and built-in
search engine
Lightweight core
architecture
Better performance,
near real-time results,
and full-spectrum
operational scale
Integrated
development
environment
The ECL IDE is a
Windows desktop
application for
developers to facilitate
ECL code development
Optimizer
Ensures that submitted
ECL code is executed at
the maximum possible
speed for the
underlying hardware
Fast performance Easy to deploy and use
Scale from small to Big
Data
Rich API for Data
Preparation,
Integration, Quality
Checking, Duplicate
Checking etc.
Parallelized Machine
Learning Algorithms for
distributed data