Game-Changing Architectural Advances Take Data Analytics to New Performance Heights
Game-Changing Architectural Advances Take Data Analytics
to New Performance Heights
Transcript of a BrieﬁngsDirect podcast on how new advances in collocating applications with
data architecturally provides analytics performance breakthroughs.
Listen to the podcast. Find it on iTunes/iPod and Podcast.com. Download the transcript. Learn
more. Sponsor: Aster Data Systems.
Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're
listening to BrieﬁngsDirect.
Today, we present a sponsored podcast discussion on how new architectures for data and logic
processing are ushering in a game-changing era of advanced analytics.
These new approaches support massive data sets to produce powerful insights and
analysis, yet with unprecedented price-performance. As we enter 2010, enterprises
are including more forms of diverse data into their business intelligence (BI)
activities. They're also diversifying the types of analysis that they expect from these
We're also seeing more kinds and sizes of companies and government agencies seeking to deliver
ever more data-driven analysis for their employees, partners, users, and citizens. It boils down to
giving more communities of participants what they need to excel at whatever they're doing. By
putting analytics into the hands of more decision makers, huge productivity wins across entire
economies become far more likely.
But such improvements won’t happen if the data can't effectively reach the application's logic, if
the systems can't handle the massive processing scale involved, or the total costs and complexity
are too high.
In this discussion we examine how convergence of data and logic, of parallelism and MapReduce
-- and of a hunger for precise analysis with a ﬂood of raw new data -- all are setting the stage for
powerful advanced analytics outcomes.
Here to help us learn how to attain advanced analytics and to uncover the beneﬁts from these
new architectural activities for ubiquitous BI, are Jim Kobielus, senior analyst at Forrester
Research. Welcome, Jim.
Jim Kobielus: Hi, Dana. Hi, everybody.
Gardner: We're also joined by Sharmila Mulligan, executive vice president of marketing at
Aster Data. Welcome, Sharmila.
Sharmila Mulligan: Thank you. Hello, everyone.
Gardner: Jim, let me start with you. We're looking at a shift now, as I have mentioned, in
response to oceans of data and the need for analysis across different types of applications and
activities. What needs to change? The demands are there, but what needs to change in terms of
how we provide the solution around these advanced analytical undertakings?
Kobielus: First, Dana, we need to rethink the platforms with which we're doing analytical
processing. Data mining is traditionally thought of as being the core of advanced
analytics. Generally, you pull data from various sources into an analytical data
That analytical data mart is usually on a database that's speciﬁc to a given
predictive modeling project, let's say a customer analytics project. It may be a
very fast server with a lot of compute power for a single server, but quite often
what we call the analytical data mart is not the highest performance database you have in your
company. Usually, that high performance database is your data warehouse.
As you build larger and more complex predictive models -- and you have a broad range of
models and a broad range of statisticians and others building, scoring, and preparing data for
these models -- you quickly run into resource constraints on your existing data-mining platform,
really. So, you have to look for where you can ﬁnd the CPU power, the data storage, and the I/O
bandwidth to scale up your predictive modeling efforts. That's the number one thing. The data
warehouse is the likely suspect.
Also, you need to think about the fact that these oceans of data need to be prepared, transformed,
cleansed, meshed, merged, and so forth before they can be brought into your analytical data mart
for data mining and the like.
Quite frankly, the people who do predictive modeling are not specialists at data preparation.
They have to learn it and they sometimes get very good at it, but they have to spend a lot of time
on data mining projects, involved in the grunt work of getting data in the right format just to
begin to develop the models.
As you start to rethink your whole advanced analytics environment, you have to think through
how you can automate to a greater degree all these data preparation, data loading chores, so that
the advanced analytics specialists can do what they're supposed to do, which is build and tune
models of various problem spaces. Those are key challenges that we face.
But, there is one third challenge, which is advanced analytics producing predictive models.
Those predictive models increasingly are deployed in line to transactional applications, like your
call center, to provide some basic logic and rules that will drive such important functions as "next
best offer" being made to customers based on a broad variety of historical and current
How do you inject predictive logic into your transactional applications in a fairly seamless way?
You have to think through that, because, right now, quite often analytical data models, predictive
models, in many ways are not built for optimal embedding within your transactional application.
You have to think through how to converge all these analytical models with the transactional
logic that drives your business?
Gardner: Okay. Sharmila, are your users or the people that you talk to in the market aware that
this shift is under way? Do they recognize that the same old way of doing things is not going to
sustain them going forward?
New data platform
Mulligan: What we see with customers is that the advanced analytics needs and the new
generation of analytics that they are trying to do is driving the need for a new
Previously, the choice of a data management platform was based primarily on
price-performance, being able to effectively store lots of data, and get very
good performance out of those systems. What we're seeing right now is that,
although price performance continues to be a critical factor, it's not necessarily
the only factor or the primary thing driving their need for a new platform.
What's driving the need now, and one of the most important criteria in the selection process, is
the ability of this new platform to be able to support very advanced analytics.
Customers are very precise in terms of the type of analytics that they want to do. So, it's not that
a vendor needs to tell them what they are missing. They are very clear on the type of data
analysis they want to do, the granularity of data analysis, the volume of data that they want to be
able to analyze, and the speed that they expect when they analyze that data.
They are very clear on what their requirements are, and those requirements are coming from the
top. Those new requirements, as it relates to data analysis and advanced analytics, are driving the
selection process for a new data management platform.
There is a big shift in the market, where customers have realized that their preexisting platforms
are not necessarily suitable for the new generation of analytics that they're trying to do.
Gardner: Let's take a pause and see if we can't deﬁne these advanced analytics a little better.
Jim, what do we mean nowadays when we say "advanced analytics?"
Kobielus: Different people have their deﬁnitions, but I'll give you Forrester's deﬁnition, because
I'm with Forrester. And, it makes sense to break it down into basic analytics versus advanced
What is basic analytics? Well, that's BI. It's the core of BI that you build your decision support
environment on. That's reporting, query, online analytical processing, dashboarding, and so forth.
It's fairly clear what's in the core scope of BI.
Traditional basic analytics is all about analytics against deep historical datasets and being able to
answer questions about the past, including the past up to the last ﬁve seconds. It's the past that's
the core focus of basic analytics.
What's likely to happen
Advanced analytics is focused on how to answer questions about the future? It's what's likely to
happen -- forecast, trend, what-if analysis -- as well as what I like to call
the deep present, really current streams for complex event processing.
What's streaming in now, and how can you analyze the great gushing
streams of information that are emanating from all your applications, your
workﬂows, and from social networks?
Advanced analytics is all about answering future-oriented, proactive, or predictive questions, as
well as current streaming, real-time questions about what's going on now. Advanced analytics
leverages the same core features that you ﬁnd in basic analytics -- all the reports, visualizations,
and dashboarding -- but then takes it several steps further.
First and foremost, it's all about amassing a data warehouse or a data mart full of structured and
unstructured information and being able to do both data mining against the structured
information, and text analytics or content analytics against the unstructured content.
Then, in the unstructured content, it's being able to do some important things, like natural
language processing to look for entities and relationships and sentiments and the voice of the
customer, so you can then extrapolate or predict what might happen in the future. What might
happen if you make a given offer to a given customer at a given time? How are they likely to
respond? Are they likely to jump to the competition? Are they likely to purchase whatever you're
offering? All those kinds of questions.
Gardner: Sharmila, do you have anything to offer further on deﬁning advanced analytics in this
Mulligan: Before I go into advanced analytics, I'd like to add to what Jim just talked about on
basic analytics. The query and reporting aspect continues to be very important, but the difference
now is that the size of the data set is far larger than what the customer has been running with
What you've got is a situation where they want to be able to do more scalable reporting on
massive data sets with very, very fast response times. On the reporting side, in terms of the end
result to the customer, it is similar to the type of report they are trying to achieve, but the
difference is that the quantity of data that they're trying to get at, and the amount of data that
these reports are ﬁlling up is far greater than what they had before.
That's what's driving a need for a new platform underneath some of the preexisting BI tools that
are, in themselves, good at reporting, but what the BI tools need is a data platform beneath them
that allows them to do more scalable reporting than you could do before.
Kobielus: I just want to underline that, Sharmila. What Forrester is seeing is that, although the
average data warehouse today is in the 1-10 terabyte range for most companies, we foresee the
average warehouse size going, in the middle of the coming decade, into the hundreds of
In 10 years or so, we think it's possible, and increasingly likely, that petabyte-scale data
warehouses or content warehouses will become common. It's all about unstructured information,
deep history, and historical information. A lot of trends are pushing enterprises in the direction of
Managing big data
Mulligan: Absolutely. That is obviously the big topic here, which is, how do you manage big
data? And, big data could be structured or it could be unstructured. How do you assimilate all
this in one platform and then be able to run advanced analytics on this very big data set?
Going back to what Jim discussed on advanced analytics, we see two big themes. One is
the real-time nature of what our customers want to do. There are particular use cases, where what
they need is to be able to analyze this data in near real-time, because that's critical to being able
to get the insights that they're looking for.
Fraud analytics is a good example of that. Customers have been able to do fraud analytics, but
they're running fraud checks after the fact and discovering where fraud took place after the event
has happened. Then, they have to go back and recover from that situation. Now, what customers
want, is to be able to run fraud analytics in near real-time, so they can catch fraud while it's
What you see is everything from cases in ﬁnancial services companies related to product fraud,
as well as, for example, online gaming sites, where users of the system are collaborating on the
site and trying to commit fraud. Those type of scenarios demand a system that can return the
fraud analysis data near real-time, so it can block these users from conducting fraud while it's
The other big thing we see is the predictive nature of what customers are trying to do. Jim talked
about predictive analytics and modeling analytics. Again, that's a big area that we see massive
new opportunity and a lot of new demand. What customers are trying to do there is look at their
own customer base to be able to analyze data, so that they can predict trends in the future.
For example, what are the buying trends going to be, let's say at Christmas, for consumers who
live in a certain area? There is a lot around behavior analysis. In the telco space, we see a lot of
deep analysis around trying to model behavior of customers on voice usage of their mobile
devices versus data usage.
By understanding some of these patterns and the behavior of the users in more depth, these
organizations are now able to better service their customers and offer them new product
offerings, new packages, and a higher level or personalization, by understanding the behavior of
their customers in more depth.
Predictive analytics is a term that's existed for a while, and is something that customers have
been doing, but it's really reaching new levels in terms of the amount of data that they're trying to
analyze for predictive analytics, and in the granularity of the analytics itself in being able to
deliver deeper predictive insight and models.
As I said, the other big theme we see is the push toward analysis that's really more near real time
than what they were able to do before. This is not a trivial thing to do when, it comes to very
large data sets, because what you are asking for is the ability to get very, very quick response
times and incredibly high performance on terabytes and terabytes of data to be able to get these
kind of results in real-time.
Gardner: Jim, these examples that Sharmila has shared aren't just rounding errors. This isn't a
movement toward higher efﬁciency. These are game changers. These are going to make or break
your business. This is going to allow you to adjust to a changing economy and to shifting
preferences by your customers. We're talking about business fundamentals here.
Social network analysis
Kobielus: We certainly are. Sharmila was discussing behavioral analysis, for example, and
talking about carrier services. Let's look at what's going to be a true game changer, not just for
business, but for the global society. It's a thing called social network analysis.
It's predictive models, fundamentally, but it's predictive models that are applied to analyzing the
behaviors of networks of people on the web, on the Internet, Facebook, and Twitter, in your
company, and in various social network groupings, to determine classiﬁcation and clustering of
people around common afﬁnities, buying patterns, interests, and so forth.
As social networks weave their way into not just our consumer lives, but our work lives, our life
lives, social network analysis -- leveraging all the core advanced analytics of data mining and
text analytics -- will take the place of the focus group. In an online world, everything is virtual.
As a company, you're not going to be able, in any meaningful way, to bring together your users
into a single room and ask them what they want you to do or provide for them.
What you're going to do, though, is listen to them. You're going to listen to all their tweets and
their Facebook updates and you're going to look at their interactions online through your portal
and your call center. Then, you're going to take all that huge stream of event information -- we're
talking about complex event processing (CEP) -- you're going to bring it into your data
warehousing grid or cloud.
You're also going to bring historical information on those customers and their needs. You're
going to apply various social network behavioral analytics models to it to cluster people into the
categories that make us all kind of squirm when we hear them, things like yuppie and Generation
X and so forth. Professionals in the behavioral or marketing world are very good at creating
segmentation of customers, based on a broad range of patterns.
Social network analysis becomes more powerful as you bring more history into it -- last year,
two years, ﬁve years, 10 years worth of interactions -- to get a sense for how people will likely
respond likely to new offers, bundles, packages, campaigns, and programs that are thrown at
them through social networks.
It comes down to things like Sharmila was getting at, simple things in marketing and sales, such
as a Hollywood studio determining how a movie is being perceived by the marketplace, by
people who go out to the theater and then come out and start tweeting, or even tweeting while
they are in the theater -- "Oh, this movie is terrible" or "This movie rocks."
They can get a sense of how a product or service is being perceived in real-time, so that the the
provider of that product or service can then turn around and tweak that marketing campaign, the
pricing, and incentives in real-time to maximize the yield, the revenue, or proﬁt of that event or
product. That is seriously powerful and that's what big data architectures allow you to do.
If you can push not just the analytic models, but to some degree bring transactional applications,
such as workﬂow, into this environment to be triggered by all of the data being developed or
being sifted by these models, that is very powerful.
Gardner: We know that things are shifting and changing. We know that we want to get access to
the data and analytics. And, we know what powerful things those analytics can do for us. Now,
we need to look at how we get there and what's in place that prevents us.
Let's look at this architecture. I'm looking into MapReduce more and more. I am even hearing
that people are starting to write MapReduce into their requests for proposals (RFPs), as they're
looking to expand and improve their situation. Sharmila, what's wrong with the current
environment and why do we need to move into something a bit different?
Moving the data
Mulligan: One of the biggest issues that the preexisting data pipeline faces is that the data lives
in a repository that's removed from where the analytics take place. Today, with the existing
solutions, you need to move terabytes and terabytes of data through the data pipeline to the
analytics application, before you can do your analysis.
There's a fundamental issue here. You can't move boulders and boulders of data to an application.
It's too slow, it's too cumbersome, and you're not factoring in all your fresh data in your analysis,
because of the latency involved.
One of the biggest shifts is that we need to bring the analytics logic close to the data itself.
Having it live in a completely different tier, separate from where the data lives, is problematic.
This is not a price-performance issue in itself. It is a massive architectural shift that requires
bringing analytics logic to the data itself, so that data is collocated with the analytics itself.
MapReduce, which you brought up earlier, plays a critical role in this. It is a very powerful
technology for advanced analytics and it brings capabilities like parallelization to an application,
which then allows for very high-performance scalability.
What we see in the market these days are terms like "in-database analytics," "applications inside
data," and all this is really talking about the same thing. It's the notion of bringing analytics logic
to the data itself.
I'll let Jim add a lot more to that since he has developed a lot of expertise in this area.
Gardner: Jim, are we in a perfect world here, where we can take the existing BI applications
and apply them to this new architecture of joining logic and data in proximity, or do we have to
come up with whole new applications in order to enjoy this architectural beneﬁt?
Kobielus: Let me articulate in a little bit more detail what MapReduce is and is not. MapReduce
is, among other things, a set of extensions to SQL -- SQL/MapReduce (SQL/MR). So, you can
build advanced analytic logic using SQL/MR that can essentially do the data prep, the data
transformations, the regression analyses, the scoring, and so forth, against both structured data in
your relational databases and unstructured data, such as content that you may source from RSS
feeds and the like.
To the extent that we always, or for a very long time, have been programming database
applications and accessing the data through standard SQL, SQL/MR isn't radically different from
how BI applications have traditionally been written.
But, these are extensions and they are extensions that are geared towards enabling maximum
parallelization of these analytic processes, so that these processes can then be pushed out and be
executed, not just in-databases, but in ﬁle systems, such as the Hadoop Distributed File System,
or in cloud data warehouses.
MapReduce, as a programming model and as a language, in many ways, is agnostic as to the
underlying analytic database, ﬁle system, or cloud environment where the information, as a
whole lives, and how it's processed.
But no, you can't take your existing BI applications, in terms of the reporting, query,
dashboarding, and the like, transparently move them, and use MapReduce without a whole lot of
rewriting of these applications.
You can't just port your existing BI applications to MapReduce and database analytics. You're
going to have to do some conversions, and you're going to have to rewrite your applications to
take advantage of the parallelism that SQL/MR enables.
MapReduce, in many ways, is geared not so much for basic analytics. It's geared for advanced
analytics. It's data mining and text mining. In many ways, MapReduce is the ﬁrst open
framework that the industry has ever had for programming the logic for both data mining and
text mining in a seamless way, so that those two types of advanced analytic applications can live
and breathe and access a common pool of complex data.
MapReduce is an open standard that Aster clearly supports, as do a number of other database and
data warehousing vendors. In the coming year and the coming decade, MapReduce and Hadoop
-- and I won't go to town on what Hadoop is -- will become fairly ubiquitous within the analytics
arena. And, that’s a good thing.
So, any advanced analytic logic that you build in one tool, in theory, you can deploy and have it
optimized for execution in any MapReduce-enabled platform. That’s the promise. It’s not there
yet. There are a lot of glitches, but that’s the strong promise.
Mulligan: I'd like to add a little bit to that Dana. In the marriage of SQL with MapReduce, the
real intent is to bring the power of MapReduce to the enterprise, so that SQL programmers can
now use that technology. MapReduce alone does require some sophistication in terms of
programming skills to be able to utilize it. You may typically ﬁnd that skill set in Web 2.0
companies, but often you don’t ﬁnd developers who can work with that in the enterprise.
What you do ﬁnd in enterprise organizations is that there are people who are very proﬁcient at
SQL. By bringing SQL together with MapReduce what enterprise organizations have is the
familiarity of SQL and the ease of using SQL, but with the power of MapReduce analytics
underneath that. So, it’s really letting SQL programmers leverage skills they already have, but to
be able to use MapReduce for analytics.
Over time, of course, it’s possible that there will be more expertise developed within enterprise
organizations to use MapReduce natively, but at this time and, we think, in the next couple of
years, the SQL/MapReduce marriage is going to be very important to help bring MapReduce
across the enterprise.
Hadoop, itself, obviously is an interesting platform too in being able to store lots of data cost
effectively. However, often customers will also want some of the other characteristics of a data
warehouse, like workload management, failover, backup recovery, etc., that the technology may
not necessarily provide.
MapReduce right now, available with massive parallel processing (MPP), the new generation of
MPP data warehouse is such a vast data solution, does bring kind of the best of both worlds. It
brings what companies need in terms of the enterprise data warehouse capabilities. It lets you put
application logic near data, as we talked about earlier. And, it brings MapReduce, but through the
SQL/MapReduce framework, which really primarily is designed to ease adoption and use of
MapReduce within the enterprise.
Gardner: Jim, we are on a journey. It’s going to be several years before we are getting to where
we want to go, but there is more maturity in some areas than others. And, there is an opportunity
to take technologies that are available now and do some real strong business outcomes and
produce those outcomes.
Give me a sense of where you see the maturity of the architecture, of the SQL, and the tools and
making these technologies converge? Who is mature? How is this shaking out a little bit?
Kobielus: Maturity is a best practice, in this case in-database analytics. As I said, it’s widely
supported through proprietary approaches by many vendors.
In terms of the maturity, it's judged by adoption of an open industry framework with cross-
vendor interoperability. it's not mature yet, in terms of MapReduce and Hadoop. There are
pioneering vendors like Aster, but there are a signiﬁcant number of established big data
warehousing vendors that have varying degrees of support now or in the near future for these
frameworks. We're seeing strong indications. In fact, Teradata already is rolling out MapReduce
and Hadoop support in their data warehousing offerings.
We're not yet seeing a big push from Oracle, or from Microsoft for that matter, in the direction of
support for MapReduce or Hadoop, but we at Forrester believe that both of those vendors, in
particular, will come around in 2010 with greater support.
IBM has made signiﬁcant progress with its support for Hadoop and MapReduce, but it hasn’t yet
been fully integrated into that particular vendor's platform.
Looking to 2010, 2011
If we look at a broad range of other data warehousing vendors like Sybase, Greenplum, and
others, most vendors have it on their roadmap. To some degree, various vendors have these
frameworks in in development right now. I think 2010 and 2011 are the years when most of the
data warehousing and also data mining vendors will begin to provide mature, interoperable
implementations of these standards.
There is a growing realization in the industry that advanced analytics is more than just being able
to mine information at rest, which is what MapReduce and Hadoop are geared to doing. You also
need to be able to mine and do predictive analytics against data in motion. That’s CEP.
MapReduce and Hadoop are not really geared to CEP applications of predictive modeling.
There needs to be, and there will be over the next ﬁve years or so, a push in the industry to
embed MapReduce and Hadoop. There are few vendors that are showing some progress toward
CEP predictive modeling, but it’s not widely supported yet, and it’s in proprietary approaches.
In this coming decade, we're going to see predictive logic deployed into all application
environments, be they databases, clouds, distributed ﬁle systems, CEP environments, business
process management (BPM) systems, and the like. Open frameworks will be used and developed
under more of a service-oriented architecture (SOA) umbrella, to enable predictive logic that’s
built in any tool to be deployed eventually into any production, transaction, or analytic
It will take at least 3 to 10 years for a really mature interoperability framework to be developed,
for industry to adopt it, and for the interoperability issues to be worked out. It’s critically
important that everybody recognizes that big data, at rest and in motion, needs to be processed by
powerful predictive models that can be deployed into the full range of transactional applications,
which is where the convergence of big data, analytics, and transactions come in.
Data warehouses, as the core of your analytics environment, need to evolve to become in their
own right application servers that can handle both the analytic applications or traditional data
warehousing in BI and data mining, as well as the transactional logic, and really handle it all
with full security and workload isolation, failover, and so forth in a way that’s seamless.
I'm really excited, for example, by what Aster has rolled out with their latest generation, 4.0 of
the Data-Application Server. I see a little bit of progress by Oracle on the Exadata V2. I'm
looking forward to seeing if other vendors follow suit and provide a cloud-based platform for a
broad range of transactional analytics.
Gardner: Sharmila, Jim has painted a very nice picture of where he expects things to go. He
mentioned Aster Data 4.0. Tell us a little bit about that, and where you see the stepping stones
Mulligan: As I mentioned earlier, one of the biggest requirements in order to be able to do very
advanced analytics on terabyte- and petabyte-level data sets, is to bring the application logic to
the data itself. Earlier, I described why you need to do this. You want to eliminate as much data
movement as possible, and you want to be able to do this analysis in as near real-time as
What we did in Aster Data 4.0 is just that. We're allowing companies to push their analytics
applications inside of Aster’s MPP database, where now you can run your application logic next
to the data itself, so they are both collocated in the same system. By doing so, you've eliminated
all the data movement. What that gives you is very, very quick and efﬁcient access to data, which
is what's required in some of these advanced analytics application examples we talked about.
Pushing the code
What kind of applications can you push down into the system? It can be any app written in
Java, C, C++, Perl, Python, .NET. It could be an existing custom application that an organization
has written and that they need to be able to scale to work on much larger data sets. That code can
be pushed down into the apps database.
It could be a new application that a customer is looking to write to do a level of analysis that they
could not do before, like real-time fraud analytics, or very deep customer behavior analysis. If
you're trying to deliver these new generations of advanced analytics apps, you would write that
application in the programming language of your choice.
You would push that application down into the Aster system, all your data would live inside of
the Aster MPP database, and the application would run inside of the same system collocated with
In addition to that, it could be a packaged application. So, it could be an application like software
as a service (SaaS) that you want to scale to be able to analyze very large data sets. So, you could
push a packaged application inside the system as well.
One of the fundamental things that we leverage to allow you to do more powerful analytics with
these applications is MapReduce. You don’t have to MapReduce enable an application when you
push it down into the apps system, but you could choose to and, by doing so, you automatically
parallelize the application, which gives you very high performance and scalability when it comes
to accessing large datasets. You also then leverage some of the analytics capabilities of
MapReduce that are not necessary inherent in something like SQL.
The key components of 4.0 drive to where it's providing you a platform that can efﬁciently and
cost effectively store massive amounts of data, plus give you a platform that allows you to do
very advanced and sophisticated analytics. To run through those key things that we've done in
4.0, is ﬁrst, the ability to push applications inside the system, so apps are collocated with the
We also offer SQL/MapReduce as the interface. Business analysts who are working with this
application on a regular basis don’t have to learn MapReduce. They can use SQL/MR and
leverage their existing SQL skills to work with that app. So, it makes it very easy for any number
of business analysts in the organization to leverage their preexisting SQL skills and work with
this app that's pushed down into the system.
Finally, in order to support the ability to run application inside a data, which as I said earlier is
nontrivial, we added fundamental new capabilities like Dynamic Mix Workload Management.
Workload Management in the Aster system works not just on data queries, but on the application
processes as well, so you can balance workloads when you have a system that's managing data
Kobielus: Sharmila, I think the greatest feature of the 4.0 is simply the ability to run predictive
models developed in SaaS or other tools in their native code without converting them necessarily
to SQL/MR. That means that your customers can then leverage that huge installed piece of
intellectual property or pool of intellectual property, all those models, bring it in, and execute it
natively within your distributed grid or cloud, as a way of avoiding having to do that rewrite. Or,
if they wish, they can migrate them or convert them over to SQL/MR. It's up to them.
That's a very attractive feature, because fundamentally the data warehousing cloud is an analytic
application server. Essentially, you want that ability to be able to run disparate legacy models in
parallel. That's just a feature that needs to be adopted by the industry as a whole.
The customer decides
Mulligan: Absolutely. I do want to clarify that the Aster 4.0 solution can be deployed in the
cloud, or it can be installed in a standard implementation on-premise, or it could be adopted in an
appliance mode. We support all three. It's up to the customer which of those deployment models
they need or prefer.
To talk in a little bit more detail about what Jim is referring to, the ability to take an existing app,
have to do absolutely no rewrite, and push that application down is, of course, very powerful to
customers. It means that they can immediately take an analytics app they already have and have
it operate on much larger data sets by simply taking that code and pushing it down.
That can be done literally within a day or two. You get the Aster system, you install it, and then,
by the second day, you could be pushing your application down.
If you choose to leverage the MapReduce analytics capabilities, then as I said earlier, you would
MapReduce enable an app. This simply means you take your existing application and, again, you
don’t have to do any rewrite of that logic. You just add MapReduce functions to it and, by doing
so, you have now MapReduce-enabled it. Then, you push it down and you have SQL/MR as an
interface to that app.
The process of MapReduce enabling an app also is very simple. It's a couple of days process.
This is not something that takes weeks and weeks to do. It literally can be done in a couple of
We had a retailer recently who took an existing app that they had already written, a new type of
analytics application that they wanted to deploy. They simply added MapReduce capabilities to it
and pushed it down into the Aster system, and it's now operating on very, very large data sets,
and performing analytics that they weren't able to originally do.
The ease of application push down and the ease of MapReduce enabling is deﬁnitely key to what
we have done in 4.0, and it allows companies to realize the value of this new type of platform
Gardner: I know it's fairly early in the roll out. Do you have any sense of metrics, from some of
these users? What do they get back? We talked earlier in the examples about what could be done
and what should be done nowadays with analysis. Do you have any sense of what they have able
to do with 4.0?
Reducing processing times
Mulligan: For example, we have talked about customers like comScore who are processing 1.6
billion rows of data on a regular basis, and their data volumes continue to grow. They have many
business analysts who operate the system and run reports on a daily basis, and they are able to
get results very quickly on a large data set.
We have customers who have gone from 5-10 minute processing times on their data set, to 5
seconds, as a result of putting the application inside of the system.
We have had fraud applications that would take 60-90 minutes to run in the traditional approach,
where the app was running outside the database, and now those applications run in 60-90
Literally, by collocating your application logic next to the data itself, you can see that you are
immediately able to go from many minutes of processing time, down to seconds, because you
have eliminated all the data movement altogether. You don’t have to move terabytes of data.
Add to that the fact that you can now access terabyte-sized data sets, versus what customers have
traditionally been left with, which is only the ability to process data sets in the order of several
tens of gigabytes or hundreds of gigabytes. Now, we have telcos, for example, processing four-
or ﬁve-terabyte data sets with very fast response time.
It's the volume of data, the speed, the acceleration, and response time that really provide the
fundamental value here. MapReduce, over and above that, allows you to bring in more analytics
Gardner: A ﬁnal word to you, Jim Kobielus. This really is a good example of how convergence
is taking place at a number of different levels. Maybe you could give us an insight into where
you see convergence happening, and then we'll have to leave it there.
Kobielus: First of all, with convergence the ﬂip side is collision. I just want to point out a few
issues that enterprises and users will have to deal with, as they move toward this best practice
called in-database analytics and convergence of the transactions and analytics.
We're talking about a collision of two cultures, or more than two cultures. Data warehousing
professionals and data mining professionals live in different worlds, as it were. They quite often
have an arm's length relationship to each other. The data warehouse traditionally is a source of
data for advanced analytics.
This new approach will require a convergence, rapprochement, or a dialog to be developed
between these two groups, because ultimately the data warehouse is where the data mining must
live. That's going to have to take place, that coming together of the tribes. That's one of the best
emerging practices that we're recommending to Forrester clients in that area.
Also, transaction systems -- enterprise resource planning (ERP) and customer relationship
management (CRM) -- and analytic systems -- BI and data warehousing -- are again two separate
tribes within your company. You need to bring together these groups to work out a common
framework for convergence to be able to take advantage of this powerful new architecture that
Sharmila has sketched out here.
Much of your transactional logic will continue to live on source systems, the ERP, CRM, supply
chain management, and the like. But, it will behoove you, as an organization, as a user to move
some transactional logic, such as workﬂow, in particular, into the data warehousing cloud to be
driven by real-time analytics and KPIs, metrics, and messages that are generated by inline
models built with MapReduce, and so forth, and pushed down into the warehousing grid or
Workﬂow, and especially rules engines, increasingly we will ﬁnd to be tightly integrated or
brought into a warehousing or analytics cloud that's got inline logic.
Another key trend for convergence is that data mining and text mining are coming together as a
single discipline. When you have structured and unstructured sources of information or you have
unstructured information from new sources like social networks and Twitter, Facebook, and
blogs, it's critically important to bring it together into your data mining environment. A key
convergence also is that data at rest and data in motion are converging, and so a lot of this will be
real-time event processing.
Those are the key convergence and collision avenues that we are looking at going forward.
Gardner: Very good. We've been discussing how new architectures for data and logic processing
are ushering in this game-changing era of advanced analytics. We've been joined by Jim
Kobielus, senior analyst at Forrester Research. Thanks so much, Jim.
Kobielus: No problem. I enjoyed it.
Gardner: Also, we have been talking with Sharmila Mulligan, executive vice president of
marketing at Aster Data. Thank you Sharmila.
Mulligan: Thanks so much, Dana.
Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions. You've been listening
to a sponsored BrieﬁngsDirect podcast. Thanks for listening, and come back next time.
Listen to the podcast. Find it on iTunes/iPod and Podcast.com. Download the transcript. Learn
more. Sponsor: Aster Data Systems.
Transcript of a BrieﬁngsDirect podcast on how new advances in collocating applications with
data architecturally provides analytics performance breakthroughs. Copyright Interarbor
Solutions, LLC, 2005-2010. All rights reserved.
You may also be interested in:
• Aster Data archietcts application logic with data for speeded-up analytics processing en
• Aster targets mid-market with budget-conscious, massively parallel data warehousing
• A technical look at how parallel processing brings vast new capabilities to large-scale