GROUP 3 MEMBERS
Big Data is the term for a collection of data sets so
large and complex that it becomes difficult to
processes using on-hand databases management tools
or traditional data processing applications.
Big data can be characterized in terms of volume, velocity,
variety, and value. The huge increase in the volume of data has
already been mentioned. Terabytes of data are generated daily.
This data travels with increasing velocity.
Data is generated in real time, and real-time data analysis is
required. In addition, more varieties of data are generated (for
example, from sources such as social media, equipment sensors,
With an enormous amount of unstructured data, it is harder to
derive value from the data. Important information can be
hidden among irrelevant data.
The biggest challenge is to identify valuable data and then to
modify and extract that data so that you can analyze it.
Facebook, Yahoo and Google found themselves
collecting data on an unprecedented scale, referred to
as Big Data. The big data collections quickly
overwhelmed traditional data systems and techniques
In early 2000’s armies of PhDs developed new
techniques like BigTable, MapReduce and Google File
System to handle this big data.
Today companies in every industry find themselves
with the big data problems brought by their daily
increased ability to collect information.
APPLICATIONS OF BIG DATA
Application of big data in business
Big Data Exploration
Find, visualize, understand all big data to improve
decision making. Big data exploration addresses the
challenge that every large organization faces: information
is stored in many different systems and silos and people
need access to that data to do their day-to-day work and
make important decisions.
Enhanced 360º View of the Customer
Extend existing customer views by incorporating additional
internal and external information sources. Gain a full
understanding of customers—what makes them tick, why they
buy, how they prefer to shop, why they switch, what they’ll buy
next, and what factors lead them to recommend a company to
Application of big data in business
Security Intelligence Extension
Lower risk, detect fraud and monitor cyber security in real time.
Augment and enhance cyber security and intelligence analysis
platforms with big data technologies to process and analyze new types
(e.g. social media, emails, sensors, Telco) and sources of under-
leveraged data to significantly improve intelligence, security and law
Analyze a variety of machine and operational data for improved business
results. The abundance and growth of machine data, which can include
anything from IT machines to sensors and meters and GPS devices requires
complex analysis and correlation across different types of data sets. By using
big data for operations analysis, organizations can gain real-time visibility into
operations, customer experience, transactions and behavior
Applications of big data in business
Data Warehouse Modernization
Integrate big data and data warehouse capabilities to
increase operational efficiency. Optimize your data
warehouse to enable new types of analysis. Use big
data technologies to set up a staging area or landing
zone for your new data before determining what data
should be moved to the data warehouse. Offload
infrequently accessed or aged data from warehouse
and application databases using information
integration software and tools.
Application of big data in
The chief appeal of Big Data in healthcare lies in two distinct areas.
First, the sifting of vast amounts of data to discover trends and patterns
within them that help direct the course of treatments, generate new
research, and focus on causes that were thus far unclear. In other
words, the ‘size of the lens’ that has been used to view data results has
just become that much wider. Secondly, the sheer volume of data that
can be processed using Big Data techniques is an enabler for fields such
as drug discovery and molecular medicine.
Big Data solutions in healthcare are typically seen in advanced
analytics such as personalized medicine, drug development,
epidemiology, and require massive amounts of data and complex data
mining algorithms; real-time applications calling for complex event
processing (e.g., patient monitoring, proactive risk management),
which requires analysis of numerous real-time data streams; and in
unstructured data mining such as keeping practitioners abreast of
medical literature more effectively and efficiently, and uncovering
patterns in text, images, audio, and video.
Applications of big data in health
Big Data can enable new types of applications, which
in the past might not have been feasible due to
scalability or cost constraints. In the past, scalability in
many cases was limited due to symmetric
multiprocessing (SMP) environments, where a single
machine can only be scaled up so much when adding
more processors, memory, or disk. On the other hand,
MPP enables nearly limitless scalability. Many NoSQL
Big Data platforms such as Hadoop and Cassandra are
open source software, which can run on commodity
hardware, thus driving down hardware and software
Application of big data in research
As we move toward an era where the digitisation of information is more and
more on demand, the overall amount of data takes an exponential growth. The
research area BigData has emerged to precisely tackle the vast amounts of data
generated, as well as to investigate the strong societal impacts incurred by the
explosion of data in the society
Supporting BigData Analytics
BigData as a concept refers to data that is so large in volume, moving at an
unforeseen velocity, with such a high variation in structure and very often
veracious in nature that - to be fully exploited, explored and to derive its value -
the developement of new techniques and systems is required.
The figure to the left depicts a generic architecture for how data from different
sources, including the Social Media, medical records, DNA sequences and
communication data are typically dealt with. The architecture consists of three
main steps: data collection, processing and analysis, as well as value extraction.
The ultimate research goal of our team is to develop a computational and
statistical framework that can help us extract value from vast amounts of data
CHALLENGES OF BIG DATA
Volume: the main challenge is how to deal with the size of big
Variety: combining multiple data sets: the challenge is how to
handle multiplicity of types, sources and formats.
Velocity: one of the key challenges is how to react to the flood of
information in the time required by the application.
Veracity: data quality, data availability: How can we cope with
uncertainty, imprecision, missing values, misstatements or
untruths? How good is the data? How broad is the coverage?
How fine is the sampling resolution? How timely are the
readings? How well understood are the sampling biases? Is there
data available, at all?
Data discovery: this is a huge challenge: how to find high-
quality data from the vast collections of data that are out
there on the Web?
Quality and relevance: the challenge is determining the
quality of data sets and relevance to particular issues (i.e. is
the data set making some underlying assumption that
renders it biased or not informative for a particular
Data comprehensiveness: are there areas without coverage?
What are the implications?
Personally identifiable information: Can we extract enough
information to help people without extracting so much as
to compromise their privacy?
A major challenge in this context is how to analyse data.
Process challenges in regard to deriving insights include:
Aligning data from different sources (e.g., resolving when
two objects are the same)
Transforming the data into a form suitable for analysis
Modelling it, whether mathematically, or through some
form of simulation
Understanding the output, visualizing and sharing the
results, considering how to display complex analytics on a
The main management challenges are related to data
privacy, security, governance, and ethical issues.
The main management related challenges are ensuring that
data is used correctly, which means abiding by its intended
uses and relevant laws, tracking how the data is used,
transformed and derived, as well as managing its lifecycle.
According to Michael Blaha, “Many data warehouses
contain sensitive data such as personal data. There are legal
and ethical concerns with accessing such data. So the data
must be secured and access controlled as well as logged for
Software framework that supports distributed
applications licensed under the apache v2 licence.
Hadoop was derived from google’s Mapreduce and
google file System papers
Yahoo is the largest contributer to the project
Written in java programming language.
Hadoop is based in a file system and isn’t a db
• Hadoop is a distributed file system and data processing
engine that is designed to handle extremely high volumes
of data in any structure.
• Hadoop has two components:
– The Hadoop distributed file system (HDFS), which supports data
in structured relational form, in unstructured form, and in any
form in between
– The MapReduce programing paradigm for managing applications
on multiple distributed servers
• Hive: a query language similar to SQL (HiveQL) but
compatible with Hadoop
Why use Hadoop?
Need to process a lot of data(petabyte scale)
Need to parallelize processing across a multitude of
Gives scalability with low cost commodity hardware.
Companies using Hadoop
New York Times
Hadoop distributed file system is distributed file
Each node in a hadoop instance has a single data node.
Achieves reliability by replicating data across multiple
hosts(handle hardware failure).
Data nodes can communicate with each other.
HDFS splits input data into blocks(64/128mb)
Consists in a job tracker
Job Tracker assigns a task to idle task tracker nodes in
It uses a map function in parallel to every pair in the
input dataset and produce a list of pairs for each call
Is built on top of hadoop for providing data
summarization, query and analysis.
Provides an SQL-like language called HiveQL
Supports SELECT,JOIN,GROUP BY etc.
Eg select yearofpublication, count(booktitle) from
bxdataset group by yearofpublication;
Future of Big Data
In the future we’ll be able to store and process more data
than we can now.
Data will be able to be stored in many ways apart from
storing more data.
In future Hadoop will get better.
More things will get integrated … and that trend will
More and more data will move out of silo systems and into
central systems that provide a variety of tools running on a
variety of datasets … essentially an ‘enterprise data hub.
In future the enterprises that will do best are those that will
best leverage their technology.