2. WHAT IS BIG DATA?
Big data refers to data that is so large and complex that it exceeds the processing
capability of conventional data management systems and software techniques.
Data becomes big data when individual data stops mattering and only a large
collection of it or analysis derived from it are of value
Offers many opportunities - advancement of science, improvement of health care,
promotion of economic growth, enhancement of education system and more ways
of social interaction and entertainment.
But Big data has its issues of security and privacy too due to its huge volume,
high velocity, large variety in data sources and formats etc.
3. DIMENSIONS OF BIG DATA
Big Data possesses characteristics that can be defined by several V’s
Volume
Refers to quantity of data. Big data is defined as massive data sets with measures such as
petabytes and zeta bytes. Vast amounts of data are generated every second. Today big
data is generated by machines, networks and human interaction on systems like social
media. Volume of data to be analysed is massive.
Velocity
Deals with the accelerating speed at which data flows in from sources like business
processes, machines, networks like social media sites, mobile devices, etc. The flow of
data is continuous. Reacting quickly enough to deal with data velocity is a challenge for
most organizations.
Variety
Refers to various formats of data . Structured, numeric data in traditional databases.
Unstructured text documents, email, video, audio, stock ticker data and financial
4. Veracity
Refers to the quality of big data like biases, noise, abnormality of data, immeasurable
uncertainties and truthfulness and trustworthiness of data. Data that are erroneous,
duplicate and incomplete or outdated, as a whole are referred to as dirty data.
Valence
Refers to the connectedness of big data in the form of graphs just like atoms. Data items
are often directly connected to one another like a city is connected to its country. Two
Facebook users are connected as they are friends. A high valence data is denser.
Value
Refers to the fact how big data is going to benefit us and our organization. It helps in
measuring the usefulness of data in decision making. Queries can be run on the stored
data so as to deduce important results and gain insights
5. TOOLS FOR BIG DATA
Big Data storage and management tools
Hadoop- Provides a software framework for distributed storage and processing of big
data using the Map Reduce programming model
Cassandra- used for fast processing during very heavy writes and reads the environment
and stored data which is very large to fit on the server, but still want a friendly familiar
interface
MongoDB- used for dynamic queries, defining indexes for good performance on a big
database which makes applications faster and more efficiently at scale.
Apache Hive- Analysis of large datasets stored in HDFS. Also, used for data
summarization, query and ad-hoc analysis to process structured and semi-structured data
in Hadoop
Hbase- Used for real-time big data applications which contain billions of rows and
millions of columns in tables built for low latency operations
Cloudera- 100% open source and is the only Hadoop solution to offer batch processing,
interactive SQL and interactive search as well as enterprise-grade continuous availability.
6. TYPICAL BIG DATA ARCHITECTURE
Big data architecture varies based on a company's infrastructure and needs, but it usually
contains the following components:
1. Data sources: This can include data from databases, data from real-time sources, and
static files generated from applications, such as Windows logs.
2. Data store: Need storage for the data that will be processed via big data architecture.
Often, data will be stored in a data lake, which is a large unstructured database that
scales easily.
3. A combination of batch processing and real-time processing: Large volume of data
processed can be handled efficiently using batch processing, while real-time data
needs to be processed immediately to bring value.
4. Analytical data store: Helps keep all the data is in one place so analysis can be
comprehensive, and it is optimized for analysis rather than transactions. This might
take the form of a cloud-based data warehouse or a relational database
5. Automation: Ingesting and transforming the data, moving it in batches and stream
processes, loading it to an analytical data store, and finally deriving insights must be
in a repeatable workflow so that you can continually gain insights from your big data
7. GENERAL BIG DATA SECURITY
ISSUESInsecure Computation
Malicious programs are used by attackers to extract sensitive information from data
sources. This can also corrupt the data, leading to incorrect results in prediction or
analysis. It can also result into Denial of Services (DoS)
Input Validation and Filtering
Big Data collects inputs from multiple sources hence input validation is required. This
involves validating trusted data sources and filtering malicious data from the good one.
In big data gigabytes and terabytes of continuous data flow makes it really very difficult
to perform input validation or data filtering on the incoming batch of data.
Privacy Concerns in Data Mining and Analytics
Monetization of Big Data involves sharing of analytical results which involves multiple
challenges like invasion of privacy, invasive marketing and unintentional disclosure of
information. Quite a few examples of these include - AOL Inc. released search logs where
users could be identified easily, which was really concerning.
8. Granular Access Controls
Big data was traditionally designed with almost no security in mind. As a way out, the
parts of needed data sets, that users have right to see, are copied to a separate big data
warehouse and provided to particular user groups. For a medical research, only the
medical info (without the names, addresses) gets copied. Volumes of big data grow even
faster this way. Complex solutions adversely affect the system’s performance and
maintenance.
Insecure data storage
Authentication, authorization and encryption of data at thousands of nodes becomes a
challenging work. Auto–tiering moves cold data, which might be of use, to lesser secure
medium. Also encryption of real time data may have performance impacts. Secure
communication amongst various nodes, middlewares, and end users is disabled by
default, hence it needs to be enabled explicitly.
9. SECURITY ISSUES IN BIG DATA – SOME
RELEVANT USE CASES
Vulnerability to fake data generation
For instance, if a manufacturing company uses sensor data to detect malfunctioning
production processes, cybercriminals can penetrate the system and make the sensors
show fake results. The company can fail to notice alarming trends and miss the
opportunity to solve problems before serious damage is caused. Such challenges can be
solved through applying fraud detection approach.
Amazon’s Galaxy Data Lakes
Challenges faced by Amazon: data silos, difficulty analyzing diverse datasets, managing
data access and security.
1. A data silo is a situation wherein only one group in an organization can access a set of
data. Data is stored in different places and in different ways for international
expansion which keeps important data hidden. A data lake solves this problem by
uniting all the data into one central location.
10. 2. Amazon Prime has data for fulfilment centres and packaged goods, while Amazon
Fresh has data for grocery stores and food. Even shipping programs differ
internationally. For example, different countries sometimes have different box sizes
and shapes. Different systems may also have the same type of information, but it’s
labeled differently. For example, in Europe, the term used is “cost per unit,” but in
North America, the term used is “cost per package.”
Data lakes allow you to import any amount of data in any format because there is no
predefined schema
3. Amazon’s operations finance data are spread across more than 25 databases, with
regional teams creating their own local version of datasets. Audits and controls must
be in place for each database to ensure that nobody has improper access.With a data
lake, it’s easier to get the right data to the right people at the right time
11. Possibility of sensitive information mining
Lack of control within big data solutions may let corrupt IT specialists or evil
business rivals mine unprotected data and sell it for their own benefit.
Companies, can incur huge losses, if such information is connected with new
product/service launch, or users’ personal information. An employee of a
company in charge of the big data store can misuse his power and violate
privacy policies. For example: stalk people by monitoring through chats. To
avoid this, proper security tools should be in place and access controls should
be applied strictly at different levels in the organizations.
12. High speed of NoSQL databases’ evolution and lack of security focus
NoSQL databases, handle many challenges of big data analytics without concerning much
over security issues which is embedded only in the middleware and no explicit security
enforcement is provided. NoSQL databases have weak authentication techniques and
weak password storage mechanisms. They are subjected to attacks like JSON injection,
REST injection, man-in-the-middle attack and schema injection and others. NoSQL
databases are subjected to inside attacks as well due to lenient security mechanisms. To
avoid this the following should be done:
1. Encrypting sensitive database fields
2. Keeping unencrypted values in a sandboxed environment
3. Using sufficient input validation
4. Applying strong user authentication policies
13. RECOMMENDATIONS TO ENHANCE BIG
DATA SECURITY
Secure Your Computation Code
To prevent malicious data entry, implement access control, code signing and dynamic
analysis of the computational code. Proper strategies need to be made to control the
impact of untrusted code if it has been able to get into the big data solution.
There are generally two ways of preventing attacks: securing the data when insecure
mapper is present, and securing the mapper.
Implement Comprehensive Input Validation and Filtering.
For better security practices, implementation of input validation and filtering on internal
and external sources is recommended. Proper evaluation of key input validation and
filtering features is required
14. Implement Granular Access Control.
Defining and enforcing the roles to different the kinds of users like admin,
knowledge workers, end users, developers etc. is the core part for the
implementation of granular access control.
Use policy to define which SUDO sessions are keystroke logged based on risk
and user. Implement granular assignments for who can switch sessions ("SU”)
and Audit privileged activity
Secure data storage and computation.
Important as much part of sensitive data leakage portions are encountered in
this phase. For this, the sensitive data should be segregated. Enabling Data
Encryption for sensitive data and audit administrative access on Data Nodes
marks to be a major step.
Finally the verification of proper configuration of API security of all
components is the final step for secure data storage and computation.
15. CONCLUSION
Big data is trending. No new application can be imagined without it producing
new forms of data, operating on data driven algorithms, and consuming
specified amount of data.
With data storing and computing environments becoming more cheaper–
encryption and compliance have introduced challenges that practically need to
be handled in a very systematic manner.
There is a big ecosystem exists for specific big data problems. Major
recommendations for dealing with the security issues are implementation of
data lakes, access controls, validation, filtration and securing data storage and
computation.