This document summarizes a research paper on big data and Hadoop. It begins by defining big data and explaining how the volume, variety and velocity of data makes it difficult to process using traditional methods. It then discusses Hadoop, an open source software used to analyze large datasets across clusters of computers. Hadoop uses HDFS for storage and MapReduce as a programming model to distribute processing. The document outlines some of the key challenges of big data including privacy, security, data access and analytical challenges. It also summarizes advantages of big data in areas like understanding customers, optimizing business processes, improving science and healthcare.
Big data Analytics is a process to extract meaningful insight from big such as hidden patterns, unknown correlations, market trends and customer preferences
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
( ** Hadoop Training: https://www.edureka.co/hadoop ** )
This Edureka tutorial on "Big Data Applications" will explain various how Big Data analytics can be used in various domains. Following are the topics included in this tutorial:
1. Why do we need Big Data Analytics?
2. Big Data Applications in Health Care.
3. Big Data in Real World Clinical Analytics.
4. Big Data Analytics in Education Sector.
5. IBM Case Study in Education Section.
6. Big data applications and use cases in E-Commerce.
7. How Government uses Big Data analytics?
8. How Big data is helpful in E-Government Portal?
9. Big Data in IOT.
10. Smart city concept.
11. Big Data analytics in Media and Entertainment
12. Netflix example in Big data
13. Future Scope of Big data.
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
A changing market landscape and open source innovations are having a dramatic impact on the consumability and ease of use of data science tools. Join this session to learn about the impact these trends and changes will have on the future of data science. If you are a data scientist, or if your organization relies on cutting edge analytics, you won't want to miss this!
Big data Analytics is a process to extract meaningful insight from big such as hidden patterns, unknown correlations, market trends and customer preferences
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
( ** Hadoop Training: https://www.edureka.co/hadoop ** )
This Edureka tutorial on "Big Data Applications" will explain various how Big Data analytics can be used in various domains. Following are the topics included in this tutorial:
1. Why do we need Big Data Analytics?
2. Big Data Applications in Health Care.
3. Big Data in Real World Clinical Analytics.
4. Big Data Analytics in Education Sector.
5. IBM Case Study in Education Section.
6. Big data applications and use cases in E-Commerce.
7. How Government uses Big Data analytics?
8. How Big data is helpful in E-Government Portal?
9. Big Data in IOT.
10. Smart city concept.
11. Big Data analytics in Media and Entertainment
12. Netflix example in Big data
13. Future Scope of Big data.
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
A changing market landscape and open source innovations are having a dramatic impact on the consumability and ease of use of data science tools. Join this session to learn about the impact these trends and changes will have on the future of data science. If you are a data scientist, or if your organization relies on cutting edge analytics, you won't want to miss this!
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
Big data is a term that describes the large volume of data may be both structured and unstructured.
That inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters.
7 Big Data Challenges and How to Overcome ThemQubole
Implementing a big data project is difficult. Hadoop is complex, and data governance is crucial. Learn common big data challenges and how to overcome them.
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Data science is different from Data Analytics,Data Engineering,Big Data.
Presentation about Data Science.
What is Data Science its process future and scope.
Data Science Presentation By Amit Singh.
"Sexiest job of 21st century"
What is Big Data?
Big Data Laws
Why Big Data?
Industries using Big Data
Current process/SW in SCM
Challenges in SCM industry
How Big data can solve the problems?
Migration to Big data for an SCM industry
BIG DATA AND MACHINE LEARNING
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
Big data is a term that describes the large volume of data may be both structured and unstructured.
That inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters.
7 Big Data Challenges and How to Overcome ThemQubole
Implementing a big data project is difficult. Hadoop is complex, and data governance is crucial. Learn common big data challenges and how to overcome them.
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Data science is different from Data Analytics,Data Engineering,Big Data.
Presentation about Data Science.
What is Data Science its process future and scope.
Data Science Presentation By Amit Singh.
"Sexiest job of 21st century"
What is Big Data?
Big Data Laws
Why Big Data?
Industries using Big Data
Current process/SW in SCM
Challenges in SCM industry
How Big data can solve the problems?
Migration to Big data for an SCM industry
BIG DATA AND MACHINE LEARNING
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
Big data is everywhere , although sometimes we may not immediately realize it . First thing to be believed is that most of us don't deal with large amount of data in our life except in unusual circumstance. Lacking this immediate experience, we often fail to understand both opportunities as well challenges presented by big data. There are currently a number of issues and challenges in addressing these characteristics going forward.
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks. Hadoop is an open source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Mr. Ketan Bagade | Mrs. Anjali Gharat | Mrs. Helina Tandel "A Review Paper on Big Data and Hadoop for Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29816.pdf Paper URL: https://www.ijtsrd.com/computer-science/data-miining/29816/a-review-paper-on-big-data-and-hadoop-for-data-science/mr-ketan-bagade
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Love Arora
Big data came into existence when the traditional relational database systems were not able to handle the unstructured data (weblogs, videos, photos, social updates, human behaviour) generated today by organisation, social media, or from any other data generating source. Data that is so large in volume, so diverse in variety or moving with such velocity is called Big data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. The technologies used by big data application to handle the massive data are Hadoop, Map Reduce, Apache Hive, No SQL and HPCC. These technologies handle massive amount of data in MB, PB, YB, ZB, KB, and TB.
In this research paper various technologies for handling big data along with the advantages and disadvantages of each technology for catering the problems in hand to deal the massive data has discussed.
Big data is the term for any gathering of information sets, so expensive and complex, that it gets to be hard to process for utilizing customary information handling applications. The difficulties incorporate investigation, catch, duration, inquiry, sharing, stockpiling, Exchange, perception, and protection infringement. To reduce spot business patterns, anticipate diseases, conflict etc., we require bigger data sets when compared with the smaller data sets. Enormous information is hard to work with utilizing most social database administration frameworks and desktop measurements and perception bundles, needing rather enormously parallel programming running on tens, hundreds, or even a large number of servers. In this paper there was an observation on Hadoop architecture, different tools used for big data and its security issues.
How do data analysts work with big data and distributed computing frameworks.pdfSoumodeep Nanee Kundu
The era of big data has ushered in a new paradigm for data analysis, presenting unique challenges and opportunities. This article delves into the world of big data analytics and explores how data analysts work with distributed computing frameworks to handle large and complex datasets. We'll discuss the concept of big data, the challenges it poses, and the evolution of distributed computing frameworks. Furthermore, we'll dive into the role of data analysts, their skills and tools, and the practical applications of big data analytics. By the end of this article, readers will have a comprehensive understanding of how data analysts leverage distributed computing frameworks to extract valuable insights from vast datasets.
The software development process is complete for computer project analysis, and it is important to the evaluation of the random project. These practice guidelines are for those who manage big-data and big-data analytics projects or are responsible for the use of data analytics solutions. They are also intended for business leaders and program leaders that are responsible for developing agency capability in the area of big data and big data analytics .
For those agencies currently not using big data or big data analytics, this document may assist strategic planners, business teams and data analysts to consider the value of big data to the current and future programs.
This document is also of relevance to those in industry, research and academia who can work as partners with government on big data analytics projects.
Technical APS personnel who manage big data and/or do big data analytics are invited to join the Data Analytics Centre of Excellence Community of Practice to share information of technical aspects of big data and big data analytics, including achieving best practice with modeling and related requirements. To join the community, send an email to the Data Analytics Centre of Excellence
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
The Size of the data is increasing day by day with the using of social site. Big Data is a concept to manage and mine the large set of data. Today the concept of Big Data is widely used to mine the insight data of organization as well outside data. There are many techniques and technologies used in Big Data mining to extract the useful information from the distributed system. It is more powerful to extract the information compare with traditional data mining techniques. One of the most known technologies is Hadoop, used in Big Data mining. It takes many advantages over the traditional data mining technique but it has some issues like visualization technique, privacy etc.
Similar to Research paper on big data and hadoop (20)
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Research paper on big data and hadoop
1. Abstrct
‘Big Data’ describes techniques and
technologies to store, distribute,manage and
analyze large-sized datasets with high-velocity.
Big data can be structured, unstructured or
semi-structured, resulting in incapability of
conventional data management methods. Data
is generated from various different sources and
can arrive in the system at various rates. In
order to process these large amounts of data in
an inexpensive and efficient way, parallelism is
used.. Hadoop is the core platform for
structuring Big Data, and solves the problem of
making it useful for analytics purposes. Hadoop
is an open source software project that enables
the distributed processing of large data sets with
a very high degree of fault tolerance.
I. Introduction
A. Big Data
It refers to the efficient handling of large
amount of data that is impossible by using
traditional or conventional methods such as
relational databases or it is a technique that is
required to handle the large amount of data that
is generated with advancements in technology
and increase in population. Big data helps to
store, retrieve and modify these large data sets.
For example with the advent of smart
technology there is rapid
increase in use of mobile phones due to which
large amount of data is generated every second,
so it is impossible to handle by using traditional
methods hence to overcome this problem big
data concepts were introduced. most analysts
and practitioners currently refer to data sets
from 30-50 terabytes(10 12 or 1000 gigabytes
per terabyte) to multiple peta-bytes (1015 or
1000 terabytes per peta-byte) as big data.
II. 3 V’s of Big Data
A. Data Volume
Data volume refers to the amount of data. At
present the volume of data stored has grown
from megabytes and gigabytes to peta-bytes
and is supposed to increase to zeta-bytes in
nearby future.
B. Data Variety
Variety refers to the different types of data-–
text, images video, audio, etc and sources of
data. Data being produced is not of single
category as it not only includes the traditional
data but also the semi structureddata from
various resources like web Pages, Web Log
Files, social media sites, e-mail, documents
C. Data Velocity
Velocity in Big data is a concept which deals
with the speed of the data coming from various
sources. This characteristic is not being limited
to the speed of incoming data but also speed at
which the data flows and aggregated.
III. Problem or Challenges associated
with Big Data Processing
The challenges in Big Data are usually the real
hurdles which require immediate attention. Any
implementation without handling these
challenges may lead to the failure of the
technology implementation and some
unpleasant results
A. Size
The first thing anyone thinks of with Big Data
is its size. The word “big” is there in the very
Research Paper on Big Data and Hadoop
Solanki Raviraj Gulabsinh
Assistant Professor
Shree M.L.Kakdiya MCA
Mahila College, Amreli
2. name. Managing large and rapidly increasing
volumes of data has been a challenging issue
for many decades. In the past, this challenge
was mitigated by processors getting faster,
following Moore’s law, to provide us with the
resources needed to cope with increasing
volumes of data. But, there is a fundamental
shift underway now: data volume is scaling
faster than compute resources, and CPU speeds
are static.
B. Privacy and Security
It is the most important challenges with Big
data which is sensitive.
• The personal information (e.g. in database
of social networking website) of a person
when combined with external large data
sets, leads to the inference of new facts
about that person and it’s possible that these
kinds of facts may be secretive and the
person might not want the data owner to
know or any person to know about them.
• Information regarding the people is
collected and used in order to add value to
the business of the organization. Another
important consequence arising would be
Social sites where a person would be taking
advantages of the Big data predictive
analysis and on the other hand
underprivileged will be easily identified and
treated worse.
• Big Data increase the chances of certain
tagged people to suffer from adverse
consequences without the ability to fight
back or even having knowledge that they
are being discriminated.
C. Data Access and Sharing of Information
Due to huge amount of data, data management
and governance process is bit complex adding
the necessity to make data open and make it
available to government agencies in
standardized manner with standardized APIs,
metadata and formats. Expecting sharing of
data between companies is awkward because of
the need to get an edge in business. Sharing
data about their clients and operations threatens
the culture of secrecy and competitiveness.
D. Analytical Challenges
The main analytical challenging questions are
as:
• What if data volume gets so large and
varied and it is not known how to deal with
it?
o Does all data need to be stored?
o Does all data need to be analyzed?
o How to find out which data points
are really important?
o How can the data be used to best
advantage?
Big data brings along with it some huge
analytical challenges. The type of analysis to be
done on this huge amount of data which can be
unstructured, semi structured or structured.
E. Human Resources and Manpower
Since Big data is an emerging technology so it
needs to attract organizations and youth with
diverse new skill sets. These skills should not
be limited to technical ones but also should
extend to research, analytical, interpretive and
creative ones. These skills need to be developed
in individuals hence requires training programs
to be held by the organizations. Moreover the
Universities need to introduce curriculum on
Big data to produce skilled employees in this
expertise.
F. Technical Challenges
1. Fault Tolerance
With the incoming of new technologies like
Cloud computing and Big data it is always
intended that whenever the failure occurs the
damage done should be acceptable.
Fault-tolerant computing is extremely hard,
involving intricate algorithms. Thus the main
task is to reduce the probability of failure to an
“acceptable” level.
Two methods which seem to increase the fault
tolerance in Big data are as:
• First is to divide the whole computation
being done into tasks and assign these tasks
to different nodes for computation.
• Second is, one node is assigned the work of
observing that these nodes are working
properly.If something happens that
particular task is restarted. But sometimes
it’s quite possible that that the whole
computation can’t be divided into such
independent tasks. There could be some
tasks which might be recursive in nature
and the output of the previous computation
of task is the input to the next computation.
Thus restarting the whole computation
3. becomes cumbersome process. This can be
avoided by applying Checkpoints which
keeps the state of the system at certain
intervals of the time. In case of any failure,
the computation can restart from last
checkpoint maintained.
2. Scalability
The scalability issue of Big data has lead
towards cloud computing, which now
aggregates multiple disparate workloads with
varying performance goals into very large
clusters. This requires a high level of sharing of
resources which is expensive
and also brings with it various challenges like
how to run and execute various jobs so that we
can meet the goal of each workload cost
effectively. It also requires dealing with the
system failures in an efficient manner which
occurs more frequently if operating on large
clusters. These factors combined put the
concern on how to express the programs, even
complex machine learning tasks. There has
been a huge shift in the technologies being
used.Hard Disk Drives (HDD) are being
replaced by the solid state Drives and Phase
Change technology which are not having the
same performance between sequential and
random data transfer. Thus, what kinds of
storage devices are to be used; is again a big
question for data storage.
3. Quality of Data
Big data basically focuses on quality data
storage rather than having very large irrelevant
data so that better results and conclusions can
be drawn. This further leads to various
questions like how it can be ensured that which
data is relevant, how much data would be
enough for decision making and whether the
stored data is accurate or not to draw
conclusions from it etc.
4. Heterogeneous Data
Unstructured data represents almost every kind
of data being produced like social media
interactions, to recorded meetings, to handling
of PDF documents, fax transfers, to emails and
more. Working with unstructured data is a
cumbersome problem and of course costly too.
Converting all this unstructured data into
structured one is also not feasible.Structured
data is always organized into highly
mechanized and manageable way. It shows well
integration with database but unstructured data
is completely raw and unorganized.
IV. Hadoop: Solution for Big Data
Processing
Hadoop is open source software used to process
the Big Data. It is very popular used by
organizations/researchers to analyze the Big
Data. Hadoop is influenced by Google's
architecture, Google File System and
MapReduce. Hadoop processes the large data
sets in a distributed computing environment. An
Apache Hadoop ecosystem consists of the
Hadoop Kernel, MapReduce, HDFS and other
components like Apache Hive, Base and
Zookeeper.
A. Hadoop consists of two main
components:
1) Storage: The Hadoop Distributed File
System (HDFS): It is a distributed file system
which provides fault tolerance and designed to
run on commodity hardware. HDFS provides
high throughput access to application data and
is suitable for applications that have large data
sets. HDFS can store data across thousands of
servers. HDFS has master/slave architecture.
Files added to HDFS are split into fixed-size
blocks. Block size is configurable, but defaults
to 64 megabytes.
Fig. 2. HDFS Blocks.
2) Processing: MapReduce: It is a
programming model introduced by Google in
2004 for easily writing applications which
processes large amount of data in parallel on
large clusters of hardware in fault tolerant
manner. This operates on huge data set, splits
the problem and data sets and run it in parallel.
Two functions in MapReduce are as following:
a) Map – The function takes key/value
pairs as input and generates an
intermediate set of key/value pairs.
4. b) Reduce – The function which merges
all the intermediate values associated
with the same intermediate key.
B. HDFS Architecture
Hadoop includes a fault-tolerant storage system
called the Hadoop Distributed File System or
HDFS. HDFS is able to store huge amounts of
information, scale up incrementally and survive
the failure of significant parts of the storage
infrastructure without losing data. Hadoop
creates clusters of machines and coordinates
work among them. Clusters can be built with
inexpensive computers. If one fails, Hadoop
continues to operate the cluster without losing
data or interrupting work, by shifting work to
the remaining machines in the cluster. HDFS
manages storage on the cluster by breaking
incoming files into pieces, called “blocks,” and
storing each of the blocks redundantly across
the pool of servers. In the common case, HDFS
stores three complete copies of each file by
copying each piece to three different servers.
HDFS ARCHITECTURE
C. Map Reduce Architecture
The processing pillar in the Hadoop ecosystem
is the MapReduce framework. The framework
allows the specification of an operation to be
applied to a huge data set, divide the problem
and data, and run it in parallel. From an
analyst’s point of view, this can occur on
multiple dimensions. For example, a very large
dataset can be reduced into a smaller subset
where analytics can be applied. In a traditional
data warehousing scenario, this might entail
applying an ETL operation on the data to
produce something usable by the analyst. In
Hadoop, these kinds of operations are written as
MapReduce jobs in Java. There are a number of
higher level languages like Hive and Pig that
make writing these programs easier. The
outputs of these jobs can be written back to
either HDFS or placed in a traditional data
warehouse
MapReduce Architecture
D. Big Data Advantages and Good Practices
The Big Data has numerous advantages on
society, science and technology. Some of the
advantages (Marr, 2013) are described below:
1. Understanding and Targeting Customers
This is one of the biggest and most publicized
areas of big data use today. Here, big data is
used to better understand customers and their
behaviors and preferences. Companies are keen
to expand their traditional data sets with social
media data, browser logs as well as text
analytics and sensor data to get a more
complete picture of their customers. The big
objective, in many cases, is to create predictive
models.
2. Understanding and Optimizing
BusinessProcess
Big data is also increasingly used to optimize
business processes. Retailers are able to
optimize their stock based on predictions
generated from social media data, web search
trends and weather forecasts.HR business
processes are also being improved using big
data analytics.
3. Improving Science and Research
Science and research is currently being
transformed by the new possibilities big data
brings. Take, for example, CERN, the Swiss
nuclear physics lab with its Large Hadron
Collider, the world’s largest and most powerful
particle accelerator. Experiments to unlock the
secrets of our universe – how it started and
works - generate huge amounts of data. The
CERN data centre has 65,000 processors to
analyze its 30 petabytes of data. However, it
uses the computing powers of thousands of
5. computers distributed across 150 data centers
worldwide to analyze the data. Such computing
powers can be leveraged to transform so many
other areas of science and research.
4. Improving Healthcare and Public Health
The computing power of big data analytics
enables us to decode entire DNA strings in
minutes and will allow us to find new cures and
better understand and predict disease patterns.
5. Optimizing Machine and Device
Performance
Big data analytics help machines and devices
become smarter and more autonomous. For
example, big data tools are used to operate
Google’s self-driving car. The Toyota Prius is
fitted with cameras, GPS as well as powerful
computers and sensors to safely drive on the
road without the intervention of human beings.
6. Financial Trading
High Frequency Trading (HFT) is an area
where big data finds a lot of use today. Here,
big data algorithms are used to make trading
decisions. Today, the majority of equity trading
now takes place
via data algorithms that increasingly take into
account signals from social media networks and
news websites to make buy and sell decisions in
split seconds.
7. Improving Security and Law Enforcement
Big data is applied heavily in improving
security and enabling law enforcement. The
revelations are that the National Security
Agency (NSA) in the U.S. uses big data
analytics to foil terrorist plots (and maybe spy
on us). Others use big data techniques to detect
and prevent cyber-attacks. Police forces use big
data tools to catch criminals and even predict
criminal activity and credit card companies use
big data use it to detect fraudulent transactions.
Some other advantages and uses includes
improving sports performance, improving and
optimizing cities and countries, personal
quantification and performance optimization.
V. Conclusion
We have entered an era of Big Data. The paper
describes the concept of Big Data along with 3
Vs, Volume, Velocity and variety of Big Data.
The paper also focuses on Big Data processing
problems. These technical challenges must be
addressed for efficient and fast processing of
Big Data. The challenges include not just the
obvious issues of scale, but also heterogeneity,
lack of structure, error-handling, privacy,
timeliness, provenance, and visualization, at all
stages of the analysis pipeline from data
acquisition to result interpretation. These
technical challenges are common across a large
variety of application domains, and therefore
not cost-effective to address in the context of
one domain alone. The paper describes Hadoop
which is an open source software used for
processing of Big Data.
References
[1] S.Vikram Phaneendra, E.Madhusudhan
Reddy,“Big Datasolutions for RDBMS
problems- A survey”, In 12thIEEE/ IFIP
Network Operations & Management
Symposium (NOMS 2010) (Osaka, Japan, Apr
19{23 2013).
[2] Aveksa Inc. (2013). Ensuring “Big Data”
Security with Identity and Access Management.
Waltham, MA: Aveksa.
[3] Hewlett-Packard Development Company.
(2012). Big Security for Big Data. L.P.:
Hewlett-Packard Development Company.
[4] Kaisler, S., Armour, F., Espinosa, J. A.,
Money, W. (2013). Big Data: Issues and
Challenges Moving Forward. International
Confrence on System Sciences (pp. 995-1004).
Hawaii: IEEE Computer Soceity.
[5] Katal, A., Wazid, M., Goudar, R. H. (2013).
Big Data: Issues, Challenges, Tools and Good
Practices. IEEE, 404-409.
[6] Marr, B. (2013, November 13). The
Awesome Ways Big Data is used Today to
Change Our World. Retrieved November 14,
2013, from LinkedIn:
https://www.linkedin.com/today /post/
article/20131113065157-64875646-the-
awesome-ways-bigdata- is-used-today-
tochange-our-world.