Big Data
(This paper has some minor issues with the references at the end but is otherwise good)
Introduction
Information is one of the most important resources that companies have available to them; this
information allows decisions to be made to determine what the company is going to do for the next day,
the next month, and the next year. The core component of this important resource is data, and with a
little data, companies can have a little bit information to plan future operations. That same company
with large amounts of data, or big data as it is known, can much more accurately find trends, become
more efficient, increase productivity, and in turn be more profitable. What separates data from big data,
what defining characteristics does it have, how can such a massive resource be fully utilized, and why
should businesses, especially smaller businesses, even bother with such an undertaking.
To understand what big data is first one must start at what came before this big data revolution
that some big companies are just now at the cusp of. Before the advent of big data when companies
gathered data, first it was fairly cost prohibitive due to issue with storage of larger amounts of data and
since computers processing power was not equal to what most businesses are working with today what
those companies were trying to accomplish could end up taking larger or not being possible by the
equipment or techniques being used. Since the first reason has become less burdensome for companies
it has become easier to collect larger amounts of data and store larger amounts of data, which has
allowed some companies to use old data for things outside the original intended purpose. When a
business collects data it normally is towards a goal or trying to gain an understanding but after the
meaning from the data gathered had been extracted not much else would be done with the data and
typically thrown away. With it no longer being as cost prohibitive companies like Google were able to
reuse old data for other purposes and glean additional insight beyond what the initial set of data had
revealed. This is the idea behind big data and what companies hope to gain is more information beyond
the explicit information within very large sets of data.
Key information
How is data any different than big data; at what point does the size of this raw information
change how it’s labeled. Actually this is misleading because it is not just the size of the data, but three
defining characteristics that help to identify what big data is. According to the web site Gartner.com
(Laney, 2001), the focus area of data management were related to volume, variety, and velocity. Volume
specifies the actual size of the data being stored, and as such since overtime data storage has become
more efficient the for where big data starts is something that has changed with better technology.
Even with all of the advances in storage architecture and data ...
Big Data (This paper has some minor issues with the refere.docx
1. Big Data
(This paper has some minor issues with the references at the end
but is otherwise good)
Introduction
Information is one of the most important resources that
companies have available to them; this
information allows decisions to be made to determine what the
company is going to do for the next day,
the next month, and the next year. The core component of this
important resource is data, and with a
little data, companies can have a little bit information to plan
future operations. That same company
with large amounts of data, or big data as it is known, can much
more accurately find trends, become
more efficient, increase productivity, and in turn be more
profitable. What separates data from big data,
what defining characteristics does it have, how can such a
massive resource be fully utilized, and why
should businesses, especially smaller businesses, even bother
with such an undertaking.
To understand what big data is first one must start at what came
before this big data revolution
that some big companies are just now at the cusp of. Before the
advent of big data when companies
gathered data, first it was fairly cost prohibitive due to issue
with storage of larger amounts of data and
since computers processing power was not equal to what most
businesses are working with today what
2. those companies were trying to accomplish could end up taking
larger or not being possible by the
equipment or techniques being used. Since the first reason has
become less burdensome for companies
it has become easier to collect larger amounts of data and store
larger amounts of data, which has
allowed some companies to use old data for things outside the
original intended purpose. When a
business collects data it normally is towards a goal or trying to
gain an understanding but after the
meaning from the data gathered had been extracted not much
else would be done with the data and
typically thrown away. With it no longer being as cost
prohibitive companies like Google were able to
reuse old data for other purposes and glean additional insight
beyond what the initial set of data had
revealed. This is the idea behind big data and what companies
hope to gain is more information beyond
the explicit information within very large sets of data.
Key information
How is data any different than big data; at what point does the
size of this raw information
change how it’s labeled. Actually this is misleading because it
is not just the size of the data, but three
defining characteristics that help to identify what big data is.
According to the web site Gartner.com
(Laney, 2001), the focus area of data management were related
to volume, variety, and velocity. Volume
specifies the actual size of the data being stored, and as such
since overtime data storage has become
more efficient the for where big data starts is something that has
changed with better technology.
Even with all of the advances in storage architecture and data
3. compression the amount of data available
continues to grow faster for individual companies to manage on
their own, in fact Cisco has estimated
that by the end of 2015 there will be 4.8 Zetabytes of internet
traffic throughout the world, ("Cisco
global cloud," ), and by the end of 2020 this number will nearly
be 50 Zetabytes. As such the common
terms for categorizing data that many people are used to, like
Gigabytes, terabytes, and for some even
petabytes, are becoming insufficient to define such enormous
amount of potential information that will
be available, though big data can still be small.
Large amounts of small pieces of information can also be
considered big data, if there are
enough smaller bits to add up. Using Twitter as a prime
example, people who use this site can send out
a tweet, which is basically a text message with a character limit
of one hundred and forty characters,
which is broadcasted out to all of their followers or even to the
general public if that was the intention.
Now this piece of information still falls under big data because
of the sheer numbers of people doing
this same thing all day, every day, but besides just having
volume based on usage, it is also part of the
second characteristic of big data which has to do with variety of
data. The variety of data is the various
formats that the data can come in, normally in business data is
highly structured information contained
in charts, listed out numerically, or by some other highly
structured means, but with the rise in
popularity of social media sites and the very unstructured
format that content is in, big data has to be
4. able to process, handle, and make sense or what people are
tweeting or what their status on Facebook
might be and incorporate that into predictable patterns of
behavior. The third definition of big data is
the velocity of the data, which not only has to do with the speed
of the data but also the frequency with
which the data is updated, moving towards much faster
gathering of data but also analyzing and making
use of that final information in a rapid fashion.
The three characteristics are volume, the size of the information
which is moving from terabytes
to zetabytes, Variety which is how the content is structured and
is moving to more of an unstructured
emphasis, and velocity which is how fast that information goes
in and comes out, batch versus
streaming. With these 3 ways that data is evolving the method
which the data is handled must evolve as
well. The structure of having the data within a relative database
is no longer an option that will be able
to keep up with the characteristics that define big data, so with
that there has to be a new approach to
how this information will be handled.
With that better understanding of what makes up big data how
companies use this vast and
ever growing resource to improve business. Depending on the
size and scope of the current business it is
possible for a company to manage its own big data using
various programs available in the market place,
currently one of the most widely utilized one is a product
known as Hadoop that is made by Apache.
Hadoop is used by many large companies such as Amazon.com,
Google, and Facebook.
But what about for smaller companies that maintaining large
5. inventories of equipment to run
this software on is just not feasible, there are several services
available from places like Amazon.com,
IBM has similar offerings, and plenty of online market places
where a smaller company that does not
have the capital to invest into all the equipment and the
specialists to run and maintain such systems
yet are able to still take advantage of some of the benefits of
big data.
So why should a company take the step toward big data, after
all it makes sense for large
establishments that already have large amounts of capital in
computers used for various reason but not
necessarily for businesses like hospitals who are more focused
on taking care of patients. Making use of
big data is very important for many different fields of work, not
just in medicine, but in agriculture
where it can be used to make the most accurate predictions on
things like the health of the crop and
patterns in weather. Weather models use quite a lot of big data
techniques already to try and develop
the most comprehensive forecast that is available for the
consumer. One nice aspect of big data is that
experts in given fields will no longer be needed as much
because that same way of thinking can
eventually be applied to a computer("The big data," 2013).
Even with all of this hype and very evident indication that this
is a huge movement in computers
that is just taking off, would there be any reason why a
company shouldn’t even bother with it for the
time being. The answer to this is of course; sometimes
depending on how large the current organization
are there just are not the resources available to dedicate to do
anything even if all that money was being
6. spent to keep is on the services through one of the hosted sites.
It is decision that must not be made
lightly but the company should make sure they are aware of all
options and just beneficial have a big
data asset could be for different kinds of business.
There are realms that big data is being processed and analyzed
outside of the business front and
that is within the scientific community. Certain projects related
to science generate very large amounts
of data very quickly, that need to be recorded and then studied
to find all pieces of relevant information
that can be acquired in the research. No project has had as much
attention focused on it as the Large
Hadron Collider, located in CERN Switzerland. The wonderful
aspect of the ongoing projects being
performed here is the media attention from many scientists to
non-scientists alike, and whether people
were aware of what was going on or not big data is very much
involved. An interesting aspect of the
information gathered at CERN is how many sensors they have
available and how much data could be
available to be gathered if the scientists wanted it. The LHC has
one hundred and fifty million sensors
that deliver data forty million times per second which could
deliver almost 500 Exabyte of data per day
(LHC Brochure, 2013), which is a much larger amount than the
world currently produces on a daily
bases.
Another project within the realm of science that uses large
subsets of data and is trying to break
those down for the advancement of research is a project known
7. as [email protected] Started at Stanford
University, the idea behind folding at home is based on research
of protein structures and unfolding
those through mathematical number crunching, this goal is
achieved by using volunteers to donate the
processing power of their home computers and keep these
machines running and then sending back this
segmented but completed data back once finished and starting a
new string that automatically
downloads and starts going through the cycle again. This is an
example of a University taking a very
nontraditional approach to tackling a very large set of data and
making a game of it among computer
enthusiasts to help contribute toward research that may not have
otherwise been possible.
Another scientific machine that used big data is the IBM
computer Watson. Normally computers
or robots are created to accomplish very specialized tasks,
sometimes they need an operator and other
times they can be automated to function in limited capacity on
their own. Watson was created to
accomplish a very specific task, win Jeopardy, which an earlier
version of Watson called deep blue had
been designed for the sole function of playing a game of chess.
Well a very key difference between
Jeopardy and chess is that in chess the data is very structured,
the squares all have spaces that are
labeled, the pieces are all labeled, and pieces all have their own
individual value. Jeopardy on the other
hand is highly unstructured, the answer is asked and the correct
question must be given to earn any
points, and when using spoken language there could be things
that may have more than one meaning
depending on how the answer was phased. So in the end Watson
was able to be made with some of the
8. technology that makes processing big data possible and play an
amazing game of jeopardy and win for
IBM.
Personal observations
“Volume is Big Data's greatest challenge and as well as its
greatest opportunity.”
((Barnatt, 2012))
I believe this a very powerful statement that sums up one of the
biggest huddles that is still yet
to be overcome within this field. What I personally gather from
this statement is that the sheer volume
of big data that we have now and will have in the future is
something collectively we will have to work
together to have the ability to house all of the data that we need.
Like discussed in the key information,
the amount of data we have now is not the concern, the issue is
coming when we have nearly 5
Zetabytes of data in just a few short years and take that and
jump to almost ten times that number five
years later. So this is a huge hurdle to overcome but if we can,
the companies that can collect these very
large amounts of information and use the information
competitively will have such a huge advantage.
My main reason for believing this this is that I do believe data
to be a resource that a lot of consumer
spend these days in place of money, which is fine for most
companies because this allows the collection
of this very precious data and if a company was able to truly get
this whole process going very early on it
might give them such a competitive advantage that we may not
9. see much in the way of close
competitors for a while.
One of the pitfalls I think may be an issue is the idea of getting
too wrapped up in the data and
missing things that are occurring within the organization. This
is something I have seen from time to
time within my own place of employment, you spend much of
your time looking at the number, running
what if analysis on the metrics in place against new purposed
metrics to increase productivity. Of course
when allowed to crunch those number you can come to what
appears to be a perfect solution, that
would appear to have a minimal impact on the frontline agents
who those metrics are graded and
appear to of increased some goal, but when a change like this
went through what happens is a
breakdown on the individual level, people now have to devote
more time to the various metrics that
have changed, even the ones who had done fine in the past.
Maybe this is because we are not able to
look at all of the data available and it is possible our techniques
will get better for rolling out mass
changes to frontline employees, but I guess a serious oversight
at least for most companies in the
market place today is that change is based on averages but not
considered at the individual level which
is where if it can be done I hope big data is able to make a large
impact on the work environment,
because I do not believe any company likes high amounts of
attrition.
The last observation I have is the market dominance Hadoop
appears to have within the big data
market, I mean not only are the big companies using Hadoop,
but the companies that are offering their
10. own cloud based services are using Hadoop as the platform for
which those services will be run on. From
Amazon’s cloud service, to IBM and others out there in the
retail world but all appear to be built off the
same platform which on one hand I like because the software is
open source which is something I truly
admire about companies that can make something like this and
turn it over for free but on the other
hand the idea is a bit disconcerting because this does not create
an arena for any kind of competition. I
enjoy the competition that a rival company causes; this to me
leads to innovation which means more
cool stuff and then more innovation again.
Summary
Obviously the world of big data is very large and often times
more than a little confusing, there
are many applications that I believe will be enhanced by the
adoption and use in a much larger scale.
These improvements will not only be for area within business
either like we discussed by within the
scientific community and other groups that will need to look at
the ever increasing amount of data that
is going to available in the very near future. Some of these
numbers mentioned were not even number I
was aware we would be working with anytime soon.
For additional information on big data and topic suggestions I
would say some of the aspects of
the Hadoop, the framework for how it processes its information
will not only be very interesting but also
give some insight into a topic that may need a lot more talented
individual to fill out the ranks.
11. References
Laney, D. (2001, February 06). [Web log message]. Retrieved
from http://blogs.gartner.com/doug-
laney/files/2012/01/ad949-3D-Data-Management-Controlling-
Data-Volume-Velocity-and-
Variety.pdf
Cisco global cloud index: forecast and methodology, 2011–2016
. (n.d.). Retrieved from
http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/n
s537/ns705/ns1175/Cloud_Index_Whi
te_Paper.pdf
(2013). The big data revolution. (2013). [Web Video]. Retrieved
from
http://www.youtube.com/watch?v=5ZyQ04zzyoE
Barnatt, C. (2012, october 09). Big data. Retrieved from
http://www.explainingcomputers.com/big_data.html
"LHC Brochure, English version. A presentation of the largest
and the most powerful particle accelerator
in the world, the Large Hadron Collider (LHC), which started
up in 2008. Its role, characteristics,
technologies, etc. are explained for the general public.". CERN-
Brochure-2010-006-Eng. LHC Brochure,
English version. CERN. Retrieved 20 January 2013.