Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Putting your Big Data management strategy on right track
1. PUTTING YOUR BIG DATA
STRATEGY ON THE RIGHT TRACK
Big data brings a mix of technologies into organizations, and harnessing
those tools can be a challenge. But there are steps IT teams can take to
put their projects on the path to success. BY JACK VAUGHAN
UNLOCKING THE BUSINESS BENEFITS IN BIG DATA
2
DON’T COUNT
OUT THE DATA
WAREHOUSE
3
DATA BY ANY
OTHER NAME
4
GROWING PAINS
1
FINDING THE
RIGHT TOOLS
2. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 2
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
Information of all types is engulfing
computer systems in many organi-
zations, complicating efforts to pull
valuable business insights out of
it through big data analytics initia-
tives. At the same time, a cavalcade
of new technologies has arrived to
help companies cope with the data
influx—but sorting through those
technologies is often an intimidating
task in itself.
In addition, IT managers must
assess whether Hadoop clusters,
NoSQL databases and other big data
management tools can fit comfort-
ably into existing systems architec-
tures or if architectural modi-fica-
tions are needed to accommodate
them. The answer varies based on
factors such as planned uses, organi-
zational structures and IT maturity.
And the burgeoning business-side
interest in extracting business value
and deriving competitive advan-
tages from vaults of big data means
that there isn’t a lot of time to make
those assessments and choose
between the available technology
options. In more and more compa-
nies, big data is viewed as a precious
resource that business leaders and
data scientists want to sift through
like prospectors looking for precious
metals.
This “big data gold rush” puts
added pressure on IT and data
management strategists to quickly
deliver systems that can handle the
growing amounts, and increasing
variety, of incoming data.
One of the biggest issues in plan-
ning a big data strategy is where to
put all the data for processing and
analysis. It wasn’t long ago that
transactional data was the primary
concern and that the options for
managing it boiled down to a hand-
ful of relational databases. Multi-
dimensional databases, columnar
software and other specialized ana-
lytical engines added some choices
for warehousing data from transac-
tion systems for analysis. Even so,
in many companies the big decision
was: enterprise data warehouse
(EDW) or collection of independent
data marts?
But things have changed. Collect-
ing and analyzing data from social
media sites, sensors, system logs
SURGING VOLUMES OF STRUCTURED AND UNSTRUCTURED
DATA—WHAT WE’VE COME TO KNOW AS BIG DATA—ARE
PUTTING IT AND DATA MANAGEMENT TEAMS UNDER THE GUN.
3. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 3
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
and other nontransactional sources
has become a priority for many
organizations. And big data tech-
nologies that can support those ini-
tiatives have proliferated to such an
extent that the number of different,
and disparate, options is dizzying.
Matthew Aslett, an enterprise
software analyst at research and
advisory company The 451 Group,
has depicted the plethora of data
storage and management choices
now available in the form of a Lon-
don Underground subway map,
arraying the available technologies
as stations along color-coded lines
representing different product cat-
egories. In addition to conventional
databases, a sampling of those cat-
egories includes Hadoop file system
implementations as well as schema-
less NoSQL databases and “NewS-
QL” hybrids that use SQL-based
relational data models but aim to
provide NoSQL-like levels of data
scalability. Heightening the potential
for buyer bewilderment even more,
some categories house technologies
of widely varying stripes. In particu-
lar, NoSQL is an umbrella term that
encompasses a diverse mix of graph
databases; document, column and
key-value stores; and other types of
repositories.
Initially, many big data applica-
tions were “greenfield” projects
that didn’t face some of the issues
of typical application development
initiatives, such as the need to inte-
grate with legacy systems or struc-
tured data sources. Often, technol-
ogy-savvy data analysts and other
business users took a first hack at
doing something with unstructured
or semi-structured data under the
radar of IT and business intelligence
managers, taking advantage of the
open source nature of Hadoop and
many NoSQL tools. But big data is
definitely on the corporate radar
now, and the drive to incorporate
non-transactional forms of data into
mainstream analytics processes is
making effective deployment and
management of big data systems by
IT teams a necessity.
There are some fundamental
steps that companies can take to
get started on harnessing big data
technologies and putting their proj-
ects on the path to success. Let’s
take a closer look at a few of them.
1
FINDING THE
RIGHT TOOLS
It’s still early in the big data adop-
tion cycle, and different companies
are trying out different technolo-
gies—sometimes with the same end
goal, as a look at available user case
studies shows:
FINDING THE RIGHT TOOLS
4. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 4
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
n NoSQL databases are being
used to analyze network failure
and degradation patterns, man-
age digital assets and track and
correlate Web server log activ-
ity, among other applications.
n Hadoop systems are being
employed for uses such as
matching highway traffic pat-
terns with cell phone usage data,
evaluating consumer buying
behavior for more targeted eth-
nic demographics and creating
new financial services products
based on real-time analysis of
customer activity.
n NewSQL databases have been
tapped to support applications
that include automating real-
time pricing for air travel and
improving the scalability of util-
ity database systems.
n Analytical databases have been
applied in initiatives such as dis-
secting website user activity and
uncovering trends in GPS infor-
mation collected from taxis.
The key is to pick the right data-
base for the job at hand, in the same
way bettors at a race track try to
choose “the horse for the course,”
a phrase that refers to the ability
of some thoroughbreds to run bet-
ter on dirt or grass, or on a dry or
muddy track. But multiple database
horses might be required for differ-
ent courses within a big data envi-
ronment.
ThoughtWorks Inc., a Chicago-
based software development servic-
es company that also sells applica-
tion lifecycle management tools, has
created a hypothetical online retail
application framework to illustrate
the concept of polyglot persistence,
or using a variety of database tech-
nologies to handle different types of
data based on which technology is
the best fit in each individual case.
For example, a key-value NoSQL
data store might be best for manag-
ing website user-session data as
part of the retail framework, accord-
ing to the ThoughtWorks model. But
it envisions the use of four other fla-
vors of NoSQL databases for tasks
such as processing online shopping-
cart data, powering the site’s rec-
ommendation engine and storing
user activity logs.
And SQL-based relational data-
bases still have their place in this
new polyglot world. In the online
retail framework, relational tech-
nology is depicted as a good fit for
financial data that requires transac-
tional updates and is best served by
a tabular structure. Reporting also
could be the province of a relational
database with SQL interfaces at
the ready for exchanging data with
reporting tools.
Relational databases are efficient
at processing transactions, and
through their support for character-
istics such as transactional atomi-
FINDING THE RIGHT TOOLS
5. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 5
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
city and consistency, they offer reli-
ability and data recovery capabilities
that NoSQL technologies typically
can’t match. But relational software
often isn’t suited to text and other
unstructured forms of big data. And
it requires “a lot of maintenance on
the back end,” including the need
to carefully construct data schemas
and modify them when business
requirements change, said Pramod
Sadalage, a principal consultant at
ThoughtWorks. Those issues are
minimized with NoSQL and Hadoop
offerings.
“What we’re saying is, ‘Give the
things that belong to a certain task
to a certain database,’ ” Sadalage
said. “If you have, for example, a
[product] catalog, put it in a data-
base that is well suited for that—
then searches go faster.”
2DON’T COUNT
OUT THE DATA
WAREHOUSE
Big data management projects
might be born because existing data
warehouse systems are beginning to
sag under the weight of the data that
is flooding into organizations. But
that doesn’t mean data warehouses
are all of a sudden obsolete—just
that the nature of warehousing
data is changing to make room for
big data. “Different styles of data
warehouse architecture have come
and gone over the years,” said Philip
Russom, data management research
director at The Data Warehousing
Institute (TDWI) in Renton, Wash.
“As we move to bigger volumes and
diversity of data, we have to again
evolve the data warehouse, just as
we have in the past.”
Hadoop-based big data systems
initially were viewed as potential
data warehouse killers, but that
sentiment has largely given way to
expectations of peaceful coexis-
tence. For example, 78% of 263 IT
professionals, business users and
consultants surveyed by TDWI in
November 2012 said they thought
Hadoop systems could be a useful
complement to their data warehous-
es for supporting advanced analyt-
ics applications. In addition, 41%
saw Hadoop as an effective staging
area for information on its way to a
data warehouse. Asked if Hadoop
clusters could fully replace an EDW,
more than half of the respondents
said no; just 4% said yes (see FIGURE
1 on page 6).
Russom thinks that using Hadoop
to stage data for loading into data
warehouses is a “beachhead” for
big data technologies in companies.
But the staging process itself is one
aspect of data warehousing that
has changed significantly in recent
DON’T COUNT OUT THE DATA WAREHOUSE
6. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 6
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
years, he said. In many cases, raw
data is likely to pile up in Hadoop
systems and initially be analyzed
there. “In the old days, the data
staging area was pretty temporary,”
Russom said. “But it has evolved to
become a kind of archive.”
Even so, he doesn’t expect those
archives to exist in isolation, dis-
connected from data warehouses.
Some of the data will be moved
into EDWs, perhaps in the form of
aggregated analytics results, and the
two technologies increasingly are
being used in tandem, according to
Russom. “Hadoop-enabled analyt-
ics are sometimes deployed in silos,
but the trend is toward integrating
Hadoop and EDW data at analysis
time for maximal visibility into busi-
ness performance,” he wrote in a
report about the TDWI survey.
3
DATA BY ANY
OTHER NAME
Big data projects begun as skunk-
works or standalone undertakings
do run the risk of creating informa-
tion silos. To prevent that, organiza-
tions should incorporate them into
an overall data management strat-
DATA BY ANY OTHER NAME
FIGURE 1: HADOOP VERSUS THE DATA WAREHOUSE
SOURCE: THE DATA WAREHOUSING INSTITUTE. BASED ON A SURVEY OF 263 IT PROFESSIONALS,
BUSINESS USERS AND CONSULTANTS CONDUCTED IN NOVEMBER 2012.
n Can the HDFS augment your
enterprise data warehouse?
n Can the Hadoop Distributed File
System replace your enterprise
data warehouse?
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
n n n n n n n n n n
4%
Yes
50%
Yes
37%
Maybe
47%
Maybe
59%
No
3%
No
7. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 7
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
egy from the start, said Mark Beyer,
an analyst at Gartner Inc. in Stam-
ford, Conn. That means asking many
of the same questions IT teams ask
about conventional data as part
of data quality and governance
programs, he added. For example,
where did a particular set of big data
come from, how long must it be kept
and does it need to be remediated
before being used?
Beyer said applying proven data
management processes to pools
of big data is especially important
with information that comes from
external sources, including what he
described as “crowdsourced” data
collected from Facebook, Twitter
and other social networks. With
such data, “you don’t know if the
‘create case’ matches the use case,”
he said. Understanding the origins
of data and factors such as how fast
it changes is crucial to effective big
data management, he advised.
The bottom line, Beyer said, is
that “big data assets are no more
accurate than any other digital
information”—and often less so. As
a result, he warned IT managers to
get ready for a bumpy ride: “Big data
is an invader. Big data breaks things.
You don’t control it.” Asserting
control over the data once it’s in an
organization’s systems could mean
the difference between success and
failure in making effective use of the
information.
4
GROWING PAINS
It’s also important to recognize that
technologies such as Hadoop, its
associated MapReduce program-
ming model and NoSQL databases
aren’t automatic cure-alls for a com-
pany’s data management needs.
In addition to the data quality and
governance challenges, technical
complexities lurk around the corners
of big data environments.
For many companies, complex-
ity comes in the form of Java-based
development. Java is the program-
ming language of choice for Hadoop
and other big data technologies. But
even the large army of experienced
Java developers faces challenges
in working with Hadoop because it
doesn’t include native support for
SQL. As a result, developers can run
into difficulties in creating MapRe-
duce programs to distill Hadoop
GROWING PAINS
For many companies, complexity comes in the form
of Java-based development.
8. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 8
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
data into subsets for processing on
different compute nodes in a cluster,
said Paul Dix, CEO and founder of
Errplane, a New York-based consul-
tancy and developer of application
monitoring software. “Most Java
developers face issues in how they
think about processing data into the
MapReduce paradigm,” said Dix,
who also is a member of the New
York Hadoop User Group. “They
have to learn how to write MapRe-
duce code to work with Hadoop;
they have to learn to structure the
problem correctly.”
Programming directly in MapRe-
duce isn’t the only path developers
can take. “There are a lot of ways to
do Hadoop without writing MapRe-
duce programs from scratch,” said
Paul Mackles, senior manager of
software architecture at software
vendor Adobe Systems Inc. in San
Jose, Calif. For example, Hive, an
open source Hadoop offshoot, offers
a table-based data model and a
SQL-like language that automati-
cally compiles queries into MapRe-
duce statements for analyzing data
in Hadoop systems. Apache Pig is a
GROWING PAINS
A SNAPSHOT OF THE BIG DATA TECHNOLOGY LANDSCAPE
IT architects building big data systems have a variety of technology compo-
nents at their disposal.
n Distributions of the Hadoop file system and related MapReduce program-
ming model are offered by Cloudera, Hortonworks, MapR Technologies
and other vendors.
n Hadoop is not an island: The open source software framework is supported
by a long list of supporting tools, including Hive, HBase, Pig, HCatalog and
ZooKeeper.
n NoSQL database technology has grown into a flourishing market seemingly
overnight, populated with products such as CouchDB, Cassandra, MongoDB,
RavenDB, Redis, Riak, Neo4j and InfiniteGraph.
n Hybrid mixes of relational and non-relational technologies are emerging.
Referred to as “NewSQL” databases, they include the likes of VoltDB,
NuoDB, ScaleBase and Drizzle.
n Analytical databases based on a mix of relational, columnar and massively
parallel processing technology include Sybase IQ, Teradata Aster, IBM
Netezza, HP Vertica, Greenplum and ParAccel. n
9. PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 9
HOME
FINDING THE
RIGHT TOOLS
DON’T COUNT
OUT THE DATA
WAREHOUSE
DATA BY ANY
OTHER NAME
GROWING PAINS
separate platform with a high-level
language for creating highly parallel-
ized MapReduce programs. In addi-
tion, software vendors such as Clou-
dera Inc. are starting to offer their
own SQL query engines for Hadoop.
Mixing Java skills and SQL add-
ons doesn’t assure Hadoop suc-
cess, though. Converting queries to
MapReduce in Hive “works fairly
well, but it isn’t always a clean tran-
sition,” Dix said.
Hive queries often require tuning
to attain the best possible perfor-
mance, according to Mackles. Data
joins are “not its strong suit,” he said
during a presentation at TDWI’s
2013 BI Executive Summit in Las
Vegas. Working with MapReduce
typically incurs performance hits at
the start of query jobs and imposes
more processing overhead while
they’re running, he added.
Finding a good starting point for
a would-be Hadoop development
team can help build both skills and
confidence. One possible starter
project recommended by Dix: put-
ting Web server log files into a
Hadoop cluster and then applying
MapReduce to the data to find out,
say, average response times on
webpages or the number of page-
loading errors generated by a Web
application. “That’s the low-hanging
fruit,” he said.
Mackles listed a variety of new
and upgraded tools that are being
developed to help organizations
get over the big data hump. That
includes a second-generation ver-
sion of MapReduce called Yarn; a
table and storage management util-
ity named HCatalog; and Hadoop
2.0, which is available in an alpha
release and is designed to make
real-time processing and querying of
Hadoop data more feasible, among
other improvements. “Hadoop has
been around long enough that I
think a lot of the shortcomings are
pretty well known,” Mackles said,
adding that Hadoop 2.0 addresses
many of the issues.
Those technologies and others
might well help the big data man-
agement and analytics cause, but
they further add to the vast and
growing forest of tools that IT, data
warehousing and data management
professionals need to navigate in
planning and managing deploy-
ments. It’s a challenge that likely will
be faced in more and more compa-
nies, though. In the TDWI survey,
only 10% of the respondents said
their organizations had Hadoop
systems in production use—but
another 51% said they expected to
be Hadoop users within three years.
The corporate spotlight will be on
the IT teams responsible for build-
ing scalable big data systems and
integrating them into existing data
warehousing and analytics environ-
ments. Finding the right technolo-
gies, and managing the process in a
way that gets the most out of them,
will help keep the glare of that light
from getting too hot. n
GROWING PAINS