SlideShare a Scribd company logo
ALSO INSIDE
DoAllRoadsLeadBacktoSQL?>>
ApplyingtheLambdaArchitecture>>
FromtheVault:
EasyReal-TimeBigDataAnalysisUsing
Storm>>
www.drdobbs.com
Dr.Dobb’sJournalNovember 2013
Really Big Data
Understanding
What Big Data
Can Deliver
November 2013 2
C O N T E N T S
COVER ARTICLE
8 Understanding What
Big Data Can Deliver
By Aaron Kimball
It’s easy to err by pushing data to fit a projected model. Insights
come,however,from accepting the data’s ability to depict what is
going on,without imposing an a priori bias.
GUEST EDITORIAL
3 Do All Roads Lead Back to SQL?
By Seth Proctor
After distancing themselves from SQL,NoSQL products are mov-
ing towards transactional models as “NewSQL” gains popularity.
What happened?
FEATURES
15 Applying the Big Data Lambda Architecture
By Michael Hausenblas
A look inside a Hadoop-based project that matches connections in
socialmediabyleveragingthehighlyscalablelambdaarchitecture.
23 From the Vault:Easy Real-Time
Big Data Analysis Using Storm
By Shruthi Kumar and Siddharth Patankar
If you're looking to handle big data and don't want to tra-
verse the Hadoop universe,you might well find that using
Storm is a simple and elegant solution.
6 News Briefs
By Adrian Bridgwater
Recent news on tools,platforms,frameworks,and the state
of the software development world.
7 Open-Source Dashboard
A compilation of trending open-source projects.
34 Links
Snapshots of interesting items on drdobbs.com including a
look at the first steps to implementing Continuous Delivery
and developing Android apps with Scala and Scaloid.
www.drdobbs.com
November 2013
Dr.Dobb’sJournal
More on DrDobbs.com
JoltAwards:TheBestBooks
Five notable books everyserious programmer should read.
http://www.drdobbs.com/240162065
A Massively Parallel Stack for Data Allocation
Dynamic parallelism is an important evolutionary step
in the CUDA software development platform.With it,
developers can perform variable amounts of work
based on divide-and-conquer algorithms and in-memory
data structures such as trees and graphs — entirely
on the GPU without host intervention.
http://www.drdobbs.com/240162018
Introduction to Programming with Lists
What it’s like to program with immutable lists.
http://www.drdobbs.com/240162440
WhoAreSoftwareDevelopers?
Ten years of surveys show an influx of younger devel-
opers, more women, and personality profiles at odds
with traditional stereotypes.
http://www.drdobbs.com/240162014
Java and IoT In Motion
Eric Bruno was involved in the construction of the In-
ternet of Things (IoT) concept project called “IoT In
Motion.” He helped build some of the back-end com-
ponents including a RESTful service written in Java
with some database queries,and helped a bit with the
front-end as well.
http://www.drdobbs.com/240162189
www.drdobbs.com
uch has been made in the past several years about SQL
versus NoSQL and which model is better suited to mod-
ern, scale-out deployments. Lost in many of these argu-
ments is the raison d’être for SQL and the difference be-
tween model and implementation. As new architectures emerge, the
question is why SQL endures and why there is such a renewed interest
in it today.
Background
In 1970,Edgar Codd captured his thoughts on relational logic in a pa-
per that laid out rules for structuring and querying data
(http://is.gd/upAlYi). A decade later, the Structured Query Language
(SQL) began to emerge. While not entirely faithful to Codd’s original
rules, it provided relational capabilities through a mostly declarative
language and helped solve the problem of how to manage growing
quantities of data.
Over the next 30 years, SQL evolved into the canonical data-man-
agement language, thanks largely to the clarity and power of its un-
derlying model and transactional guarantees. For much of that time,
deployments were dominated by scale-up or “vertical” architectures,
in which increased capacity comes from upgrading to bigger,individ-
ual systems.Unsurprisingly, this is also the design path that most SQL
implementations followed.
The term “NoSQL” was coined in 1998 by a database that provided
relational logic but eschewed SQL (http://is.gd/sxH0qy).It wasn’t until
2009 that this term took on its current,non-ACID meaning.By then,typ-
ical deployments had already shifted to scale-out or “horizontal” mod-
els.The perception was that SQL could not provide scale-out capability,
and so new non-SQL programming models gained popularity.
Fast-forward to 2013 and after a period of decline, SQL is regaining
popularity in the form of NewSQL (http://is.gd/x0c5uu) implementa-
tions. Arguably, SQL never really lost popularity (the market is esti-
mated at $30 billion and growing),it just went out of style.Either way,
this new generation of systems is stepping back to look at the last 40
years and understand what that tells us about future design by apply-
ing the power of relational logic to the requirements of scale-out de-
ployments.
Why SQL?
SQL evolved as a language because it solved concrete problems. The
relational model was built on capturing the flow of real-world data.If a
purchase is made,it relates to some customer and product.If a song is
[GUEST EDITORIAL]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 3
Do All Roads Lead Back to SQL?
After distancing themselves from SQL, NoSQL products are moving towards transactional models as
“NewSQL” gains popularity.What happened?
By Seth Proctor
M
www.drdobbs.com
played, it relates to an artist, an album, a genre, and so on. By defining
these relations,programmers know how to work with data,and the sys-
tem knows how to optimize queries. Once these relations are defined,
then other uses of the data (audit,governance,etc.) are much easier.
Layered on top of this model are transactions. Transactions are
boundaries guaranteeing the programmer a consistent view of the
database, independent execution relative to other transactions, and
clear behavior when two transactions try to make conflicting changes.
That’s the A (atomicity),C (consistency),and I (isolation) in ACID.To say
a transaction has committed means that these rules were met, and
that any changes were made Durable (the D in ACID).Either everything
succeeds or nothing is changed.
Transactions were introduced as a simplification.They free develop-
ers from having to think about concurrent access,locking,or whether
their changes are recorded.In this model,a multithreaded service can
be programmed as if there were only a single thread. Such program-
ming simplification is extremely useful on a single server.When scaling
across a distributed environment,it becomes critical.
With these features in place,developers building on SQL were able to
be more productive and focus on their applications.Of particular impor-
tance is consistency.Many NoSQL systems sacrifice consistency for scal-
ability, putting the burden back on application developers.This trade-
off makes it easier to build a scale-out database,but typically leaves de-
velopers choosing between scale and transactional consistency.
Why Not SQL?
It’s natural to ask why SQL is seen as a mismatch for scale-out archi-
tectures, and there are a few key answers. The first is that traditional
SQL implementations have trouble scaling horizontally.This has led to
approaches like sharding,passive replication,and shared-disk cluster-
ing. The limitations (http://is.gd/SaoHcL) are functions of designing
around direct disk interaction and limited main memory,however,and
not inherent in SQL.
A second issue is structure.Many NoSQL systems tout the benefit of
having no (or a limited) schema.In practice,developers still need some
contract with their data to be effective.It’s flexibility that’s needed —
an easy and efficient way to change structure and types as an appli-
cation evolves. The common perception is that SQL cannot provide
this flexibility, but again, this is a function of implementation. When
table structure is tied to on-disk representation, making changes to
that structure is very expensive; whereas nothing in Codd’s logic
makes adding or renaming a column expensive.
Finally,some argue that SQL itself is too complicated a language for
today’s programmers. The arguments on both sides are somewhat
subjective,but the reality is that SQL is a widely used language with a
large community of programmers and a deep base of tools for tasks
like authoring,backup,or analysis.Many NewSQL systems are layering
simpler languages on top of full SQL support to help bridge the gap
between NoSQL and SQL systems.Both have their utility and their uses
in modern environments.To many developers,however,being able to
[GUEST EDITORIAL ]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 4
“Many NoSQL systems tout the benefit of having no
(or a limited) schema.In practice,developers still need some
contract with their data to be effective”
www.drdobbs.com
reuse tools and experience in the context of a scale-out database
means not having to compromise on scale versus consistency.
Where Are We Heading?
The last few years have seen renewed excitement around SQL.
NewSQL systems have emerged that support transactional SQL, built
on original architectures that address scale-out requirements. These
systems are demonstrating that transactions and SQL can scale when
built on the right design. Google, for instance, developed F1
(http://is.gd/Z3UDRU) because it viewed SQL as the right way to ad-
dress concurrency,consistency,and durability requirements.F1 is spe-
cific to the Google infrastructure but is proof that SQL can scale and
that the programming model still solves critical problems in today’s
data centers.
Increasingly, NewSQL systems are showing scale, schema flexibility,
and ease of use. Interestingly, many NoSQL and analytic systems are
now putting limited transactional support or richer query languages
into their roadmaps in a move to fill in the gaps around ACID and de-
clarative programming.What that means for the evolution of these sys-
tems is yet to be seen, but clearly, the appeal of Codd’s model is as
strong as ever 43 years later.
— Seth Proctor serves as Chief Technology Officer of NuoDB Inc.and has more than
15yearsofexperienceintheresearch,design,andimplementationofscalablesystems.
His previous work includes contributions to the Java security framework,the Solaris
operatingsystem,andseveralopen-sourceprojects.
[GUEST EDITORIAL ]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 5
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
www.drdobbs.com
News Briefs
[NEWS]
November 2013 6
Progress Pacific PaaS Is A Wider Developer’s PaaS
Progress has used its Progress Exchange 2013 exhibition and devel-
oper conference to announce new features in the Progress Pacific plat-
form-as-a-service (PaaS) that allow more time and energy to be spent
solving business problems with data-driven applications and less time
worrying about technology and writing code. This is a case of cloud-
centric data-driven software application development supporting
workflows that are engineered to RealTime Data (RTD) from disparate
sources, other SaaS entities, sensors, and points within the Internet of
Things — for developers,these workflows must be functional for mo-
bile, on premise, and hybrid apps where minimal coding is required
such that the programmer is isolated to a degree from the complexity
of middleware,APIs,and drivers.
http://www.drdobbs.com/240162366
New Java Module In SOASTA CloudTest
SOASTA has announced the latest release of CloudTest with a new Java
module to enable developers and testers of Java applications to test
any Java component as they work to “easily scale” it.Direct-to-database
testing here supports Oracle, Microsoft SQL Server, and PostgreSQL
databases — and this is important for end-to-end testing for enterprise
developers. Also, additional in-memory processing enhancements
make dashboard loading faster for in-test analytics.New CloudTest ca-
pabilities include Direct-to-Database testing.CloudTest users can now
directly test the scalability of the most popular enterprise and open
source SQL databases from Oracle, Microsoft SQL Server, and Post-
greSQL.
http://www.drdobbs.com/240162292
HBase Apps And The 20 Millisecond Factor
MapRTechnologies has updated its M7 edition to improve HBase appli-
cation performance with throughput that is 4-10x faster while eliminat-
ing latency spikes.HBase applications can now benefit from MapR’s plat-
form to address one of the major issues for online applications,
consistent read latencies in the “less than 20 millisecond” range,as they
exist across varying workloads. Differentiated features here include ar-
chitecture that persists table structure at the filesystem layer; no com-
pactions (I/O storms) for HBase applications; workload-aware splits for
HBase applications;direct writes to disk (vs.writing to an external filesys-
tem);disk and network compression;and C++ implementation that does
not suffer from garbage collection problems seen with Java applications.
http://www.drdobbs.com/240162218
Sauce Labs and Microsoft Whip Up BrowserSwarm
Sauce Labs and Microsoft have partnered to announce Browser-
Swarm,a project to streamline JavaScript testing of Web and mobile
apps and decrease the amount of time developers spend on debug-
ging application errors. BrowserSwarm is a tool that automates test-
ing of JavaScript across browsers and mobile devices. It connects di-
rectly to a development team’s code repository on GitHub.When the
code gets updated, BrowserSwarm automatically executes a suite of
tests using common unit testing frameworks against a wide array of
browser and OS combinations. BrowserSwarm is powered on the
backend by Sauce Labs and allows developers and QA engineers to
automatically test web and mobile apps across 150+ browser / OS
combinations, including iOS, Android, and Mac OS X.
http://www.drdobbs.com/240162298
By Adrian Bridgwater
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
www.drdobbs.com
[OPEN-SOURCE DASHBOARD]
November 2013 7
TOP OPEN-SOURCE PROJECTS
Trending this month on GitHub:
jlukic/Semantic-UI JavaScript
https://github.com/jlukic/Semantic-UI
Creating a shared vocabulary for UI.
HubSpot/pace CSS
https://github.com/HubSpot/pace
Automatic Web page progress bar.
maroslaw/rainyday.js JavaScript
https://github.com/maroslaw/rainyday.js
Simulating raindrops falling on a window.
peachananr/onepage-scroll JavaScript
https://github.com/peachananr/onepage-scroll
Create an Apple-like one page scroller website (iPhone 5S website) with One
Page Scroll plugin.
twbs/bootstrap JavaScript
https://github.com/twbs/bootstrap
Sleek,intuitive,and powerful front-end framework for faster and easier Web
development.
mozilla/togetherjs JavaScript
https://github.com/mozilla/togetherjs
A service for your website that makes it surprisingly easy to collaborate in
real-time.
daviferreira/medium-editor JavaScript
https://github.com/daviferreira/medium-editor
Medium.com WYSIWYG editor clone.
alvarotrigo/fullPage.js JavaScript
https://github.com/alvarotrigo/fullPage.js
fullPage plugin by Alvaro Trigo.Create full-screen pages fast and simple.
angular/angular.js JavaScript
https://github.com/angular/angular.js
Extend HTML vocabulary for your applications.
Trending this month on SourceForge:
Notepad++ Plugin Manager
http://sourceforge.net/projects/npppluginmgr/
The plugin list for Notepad++ Plugin Manager with code for the plugin
manager.
MinGW:Minimalist GNU for Windows:
http://sourceforge.net/projects/mingw/
A native Windows port of the GNU Compiler Collection (GCC).
Apache OpenOffice
http://sourceforge.net/projects/openofficeorg.mirror/
An open-source office productivity software suite containing word processor,
spreadsheet,presentation,graphics,formula editor,and database
management applications.
YTD Android
http://sourceforge.net/projects/rahul/
Files Downloader is a free powerful utility that will help you to download your
favorite videos from youtube.The application is platform-independent.
PortableApps.com
http://sourceforge.net/projects/portableapps/
Popular portable software solution.
Media Player Classic:Home Cinema
http://sourceforge.net/projects/mpc-hc/
This project is based on the original Guliverkli project,and contains additional
features and bug fixes (see complete list on the project’s website).
Anti-Spam SMTP Proxy Server
http://sourceforge.net/projects/assp/
The Anti-Spam SMTP Proxy (ASSP) Server project aims to create an open-
source platform-independent SMTP Proxy server.
Ubuntuzilla:Mozilla Software Installer
http://sourceforge.net/projects/ubuntuzilla/
An APT repository hosting the Mozilla builds of the latest official releases of
Firefox,Thunderbird,and Seamonkey.
November 2013 8www.drdobbs.com
Understanding
What Big Data Can Deliver
It’s easy to err by pushing data to fit a projected model. Insights come, however, from accepting the
data’s ability to depict what is going on, without imposing an a priori bias.
ith all the hype and anti-hype surrounding Big Data,the data
management practitioner is, in an ironic turn of events, in-
undated with information about Big Data. It is easy to get
lost trying to figure out whether you have Big Data problems
and, if so, how to solve them. It turns out the secret to taming your Big
Data problems is in the detail data.This article explains how focusing on
the details is the most important part of a successful Big Data project.
Big Data is not a new idea.Gartner coined the term a decade ago,de-
scribing Big Data as data that exhibits three attributes:Volume,Velocity,
and Variety. Industry pundits have been trying to figure out what that
means ever since. Some have even added more “Vs” to try and better
explain why Big Data is something new and different than all the other
data that came before it.
The cadence of commentary on Big Data has quickened to the extent
that if you set up a Google News alert for “Big Data,” you will spend
more of your day reading about Big Data than implementing a Big Data
solution.What the analysts gloss over and the vendors attempt to sim-
plify is that Big Data is primarily a function of digging into the details
of the data you already have.
Gartner might have coined the term “Big Data,” but they did not
invent the concept. Big Data was just rarer then than it is today.
Many companies have been managing Big Data for ten years or
more. These companies may have not had the efficiencies of scale
that we benefit from currently, yet they were certainly paying atten-
tion to the details of their data and storing as much of it as they
could afford.
A Brief History of Data Management
Data management has always been a balancing act between the vol-
ume of data and our capacity to store,process,and understand it.
The biggest achievement of the On Line Analytic Processing (OLAP)
era was to give users interactive access to data,which was summarized
across multiple dimensions. OLAP systems spent a significant amount
of time up front to pre-calculate a wide variety of aggregations over a
data set that could not otherwise be queried interactively.The output
was called a “cube” and was typically stored in memory, giving end
users the ability to ask any question that had a pre-computed answer
and get results in less than a second.
By Aaron Kimball
[WHAT BIG DATA CAN DELIVER]
W
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
www.drdobbs.com
Big Data is exploding as we enter the era of plenty — high band-
width,greater storage capacity,and many processor cores.New soft-
ware, written after these systems became available, is different than
its forebears. Instead of highly tuned, high-priced systems that op-
timize for the minimum amount of data required to answer a ques-
tion, the new software captures as much data as possible in order
to answer as-yet-undefined queries. With this new data captured
and stored, there are a lot of details that were previously unseen.
Why More Data Beats Better Algorithms
Before I get into how detail data is used, it is crucial to understand at
the algorithmic level the signal importance of detail data. Since the
former Director ofTechnology at Amazon.com,Anand Rajaraman,first
expounded the concept that “more data beats better algorithms,” his
claim has been supported and attacked many times.The truth behind
his assertion is rather subtle. To really understand it, we need to be
more specific about what Rajaraman said,then explain in a simple ex-
ample how it works.
Experienced statisticians understand that having more training data
can improve the accuracy of and confidence in a model.For example,
say we believe that the relationship between two variables — such as
number of pages viewed on a website and percent likelihood to make
a purchase — is linear. Having more data points would improve our
estimate of the underlying linear relationship.Compare the graphs in
Figures 1 and 2, showing that more data will give us a more accurate
and confident estimation of the linear relationship.
A statistician would also be quick to point out that we cannot in-
crease the effectiveness of this pre-selected model by adding even
more data. Adding another 100 data points to Figure 2, for example,
would not greatly improve the accuracy of the model. The marginal
[WHAT BIG DATA CAN DELIVER]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 9
Figure 1:Using little data to estimate a relationship. Figure 2:The same relationship with more data.
www.drdobbs.com
[WHAT BIG DATA CAN DELIVER]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 10
benefit of adding more training data in this case decreases quickly.Given
this example, we could argue that having more data does not always
beat more-sophisticated algorithms at predicting the expected out-
come. To increase accuracy as we add data, we would need to change
our model.
The “trick” to effectively using more data is to make fewer initial as-
sumptions about the underlying model and let the data guide which
model is most appropriate. In Figure 1, we assumed the linear model
after collecting very little data about the relationship between page
views and propensity to purchase.As we will see, if we deploy our lin-
ear model, which was built on a small sample of data, onto a large
data set, we will not get very accurate estimates. If instead we are not
constrained by data collection, we could collect and plot all of the
data before committing to any simplifying assumptions. In Figure 3,
we see that additional data reveals a more complex clustering of data
points.
By making a few weak (that is,tentative) assumptions,we can evaluate
alternative models. For example, we can use a density estimation tech-
nique instead of using the linear parametric model, or use other tech-
niques. With an order of magnitude more data, we might see that the
true relationship is not linear.For example,representing our model as a
histogram as in Figure 4 would produce a much better picture of the
underlying relationship.
Linear regression does not predict the relationship between the vari-
ables accurately because we have already made too strong an assump-
tion that does not allow for additional unique features in the data to be
Figure 3:Even more data shows a different relationship. Figure 4:The data in Figure 3 represented as a histogram.
www.drdobbs.com
captured — such as the U-shaped dip between 20 and 30 on the x-
axis.With this much data, using a histogram results in a very accurate
model. Detail data allows us to pick a nonparametric model — such
as estimating a distribution with a histogram — and gives us more
confidence that we are building an accurate model.
If this were a much larger parameter space, the model itself, repre-
sented by just the histogram,could be very large.Using nonparametric
models is common in Big Data analysis because detail data allows us
to let the data guide our model selection, especially when the model
is too large to fit in memory on a single machine. Some examples in-
clude item similarity matrices for millions of products and association
rules derived using collaborative filtering techniques.
One Model to Rule Them All
The example in Figures 1 through 4 demonstrates a two-dimensional
model mapping the number of pages a customer views on a website
to the percent likelihood that the customer will make a purchase. It
may be the case that one type of customer,say a homemaker looking
for the right style of throw pillow, is more likely to make a purchase
the more pages they view. Another type of customer — for example,
an amateur contractor — may only view a lot of pages when doing re-
search. Contractors might be more likely to make a purchase when
they go directly to the product they know they want.Introducing ad-
ditional dimensions can dramatically complicate the model;and main-
taining a single model can create an overly generalized estimation.
Customer segmentation can be used to increase the accuracy of a
model while keeping complexity under control. By using additional
data to first identify which model to apply, it is possible to introduce
additional dimensions and derive more-accurate estimations. In this
example, by looking at the first product that a customer searches for,
we can select a different model to apply based on our prediction of
which segment of the population the customer falls into.We use a dif-
ferent model for segmentation based on data that is related yet dis-
tinct from the data we use for the model that predicts how likely the
customer is to make a purchase. First, we consider a specific product
that they look at and then we consider the number of pages they visit.
Demographics and Segmentation No Longer Are Sufficient
Applications that focus on identifying categories of users are built with
user segmentation systems.Historically,user segmentation was based
on demographic information. For example, a customer might have
been identified as a male between the ages of 25-34 with an annual
household income of $100,000-$150,000 and living in a particular
county or zip code.As a means of powering advertising channels such
as television, radio, newspapers, or direct mailings, this level of detail
was sufficient. Each media outlet would survey its listeners or readers
to identify the demographics for a particular piece of syndicated con-
tent and advertisers could pick a spot based on the audience segment.
With the evolution of online advertising and Internet-based media,
segmentation started to become more refined.Instead of a dozen de-
mographic attributes,publishers were able to get much more specific
[WHAT BIG DATA CAN DELIVER]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 11
“Using nonparametric models is common in Big Data
analysis because detail data allows us to let the data
guide our model selection”
www.drdobbs.com
about a customer’s profile. For example, based on Internet browsing
habits,retailers could tell whether a customer lived alone,were in a re-
lationship,traveled regularly,and so on.All this information was avail-
able previously but it was difficult to collate. By instrumenting cus-
tomer website browsing behavior and correlating this data with
purchases, retailers could fine tune their segmenting algorithms and
create ads targeted to specific types of customers.
Today, nearly every Web page a user views is connected directly to
an advertising network.These ad networks connect to ad exchanges
to find bidders for the screen real estate of the user’s Web browser.Ad
exchanges operate like stock exchanges except that each bid slot is
for a one-time ad to a specific user.The exchange uses the user’s profile
information or their browser cookies to convey the customer segment
of the user. Advertisers work with specialized digital marketing firms
whose algorithms try to match the potential viewer of an advertise-
ment with the available ad inventory and bid appropriately.
Real-Time Updating of Data Matters (People Aren’t Static)
Segmentation data used to change rarely with one segmentation map
reflecting the profile of a particular audience for months at a time; to-
day, segmentation can be updated throughout the day as customers’
profiles change.Using the same information gleaned from user behav-
ior that assigns a customer’s initial segment group, organizations can
update a customer’s segment on a click-by-click basis.Each action bet-
ter informs the segmentation model and is used to identify what in-
formation to present next.
The process of constantly re-evaluating customer segmentation
has enabled new dynamic applications that were previously impos-
sible in the offline world.For example,when a model results in an in-
correct segmentation assignment, new data based on customer ac-
tions can be used to update the model.If presenting the homemaker
with a power tool prompts the homemaker to go back to the search
bar,the segmentation results are probably mistaken.As details about
a customer emerge, the model’s results become more accurate. A
customer that the model initially predicted was an amateur contrac-
tor looking at large quantities of lumber may in fact be a professional
contractor.
By constantly collecting new data and re-evaluating the models,on-
line applications can tailor the experience to precisely what a customer
is looking for. Over longer periods of time, models can take into ac-
count new data and adjust based on larger trends. For example, a
stereotypical life trajectory involves entering into a long-term relation-
ship, getting engaged, getting married, having children, and moving
to the suburbs.At each stage in life and in particular during the transi-
tions,one’s segment group changes.By collecting detailed data about
online behaviors and constantly reassessing the segmentation model,
these life transitions are automatically incorporated into the user’s ap-
plication experience.
[WHAT BIG DATA CAN DELIVER]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 12
“Big Data has seen a lot of hype in recent years,yet it
remains unclear to most practitioners where they need to
focus their time and attention.Big Data is,in large part,
about paying attention to the details in a data set”
www.drdobbs.com
Instrument Everything
We’ve shown examples of how detail data can be used to pick better
models, which result in more accurate predictions. And I have ex-
plained how models built on detail data can be used to create better
application experiences and adapt more quickly to changes in cus-
tomer behavior.If you’ve become a believer in the power of detail data
and you’re not already drowning in it,you likely want to know how to
get some.
It is often said that the only way to get better at something is to
measure it.This is true of customer engagement as well. By recording
the details of an application,organizations can effectively recreate the
flow of interaction.This includes not just the record of purchases, but
a record of each page view, every search query, or selected category,
and the details of all items that a customer viewed. Imagine a store
clerk, taking notes as a customer browses and shops or asks for assis-
tance.All of these actions can be captured automatically when the in-
teraction is digital.
Instrumentation can be accomplished in two ways. Most modern
Web and application servers record logs of their activity to assist with
operations and troubleshooting.By processing these logs,it is possible
to extract the relevant information about user interactions with an ap-
plication. A more direct method of instrumentation is to explicitly
record actions taken by an application into a database.When the ap-
plication,running in an application server,receives a request to display
all the throw pillows in the catalog, it records this request and associ-
ates it with the current user.
Test Constantly
The result of collecting detail data, building more accurate models,
and refining customer segments is a lot of variability in what gets
shown to a particular customer. As with any model-based system,
past performance is not necessarily indicative of future results. The
relationships between variables change,customer behavior changes,
and of course reference data such as product catalogs change.In or-
der to know whether a model is producing results that help drive
customers to success,organizations must test and compare multiple
models.
A/B testing is used to compare the performance of a fixed number
of experiments over a set amount of time.For example,when deciding
which of several versions of an image of a pillow a customer is most
likely to click on,you can select a subset of customers to show one im-
age or another.What A/B testing does not capture is the reason behind
a result.It may be by chance that a high percentage of customers who
saw version A of the pillow were not looking for pillows at all and
would not have clicked on version B either.
An alternative to A/B testing is a class of techniques called Bandit al-
gorithms. Bandit algorithms use the results of multiple models and
[WHAT BIG DATA CAN DELIVER]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 13
AutomaticDataCollection
Somedataisalreadycollectedautomatically.EveryWebserverrecordsdetailsabout
theinformationrequestedbythecustomer’sWebbrowser.Whilenotwellorganized
orobviouslyusable,thisinformationoftenincludessufficientdetailtoreconstructa
customer’s session.The log records include timestamps,session identifiers,client IP
address and the request URL including the query string.If this data is combined
with a session table, a geo-IP database and a product catalog, it is possible to
fairly accuratelyreconstructthecustomer’sbrowsingexperience.
www.drdobbs.com
constantly evaluate which experiment to run. Experiments that per-
form better (for any reason) are shown more often. The result is that
experiments can be run constantly and measured against the data col-
lected for each experiment.The combinations do not need to be pre-
determined and the more successful experiments automatically get
more exposure.
Conclusion
Big Data has seen a lot of hype in recent years,yet it remains unclear
to most practitioners where they need to focus their time and at-
tention. Big Data is, in large part, about paying attention to the de-
tails in a data set. The techniques available historically have been
limited to the level of detail that the hardware available at the time
could process. Recent developments in hardware capabilities have
led to new software that makes it cost effective to store all of an or-
ganization’s detail data. As a result, organizations have developed
new techniques around model selection, segmentation and experi-
mentation. To get started with Big Data, instrument your organiza-
tion’s applications, start paying attention to the details, let the data
inform the models — and test everything.
—AaronKimballfoundedWibiDatain2010andistheChiefArchitectfortheKijiproj-
ect.HehasworkedwithHadoopsince2007andisacommitterontheApacheHadoop
project.Inaddition,AaronfoundedApacheSqoop,whichconnectsHadooptorelational
databasesandApacheMRUnitfortestingHadoopprojects.
[WHAT BIG DATA CAN DELIVER]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 14
November 2013 15www.drdobbs.com
Applying the Big Data
Lambda Architecture
A look inside a Hadoop-based project that matches connections in social media by leveraging
the highly scalable lambda architecture.
ased on his experience working on distributed data process-
ing systems at Twitter, Nathan Marz recently designed a
generic architecture addressing common requirements,
which he called the Lambda Architecture. Marz is well-
known in Big Data: He’s the driving force behind Storm (see page 24)
and atTwitter he led the streaming compute team,which provides and
develops shared infrastructure to support critical real-time applications.
Marz and his team described the underlying motivation for building
systems with the lambda architecture as:
• The need for a robust system that is fault-tolerant, both against
hardware failures and human mistakes.
• To serve a wide range of workloads and use cases, in which low-
latency reads and updates are required.Related to this point,the
system should support ad-hoc queries.
• The system should be linearly scalable, and it should scale out
rather than up, meaning that throwing more machines at the
problem will do the job.
• The system should be extensible so that features can be added
easily, and it should be easily debuggable and require minimal
maintenance.
From a bird’s eye view the lambda architecture has three major com-
ponents that interact with new data coming in and responds to
queries,which in this article are driven from the command line:
By Michael Hausenblas
[LAMBDA]
B
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
Figure 1:Overview of the lambda architecture.
www.drdobbs.com
Essentially, the Lambda Architecture comprises the following com-
ponents,processes,and responsibilities:
• New Data:All data entering the system is dispatched to both the
batch layer and the speed layer for processing.
• Batch layer:This layer has two functions:(i) managing the master
dataset, an immutable, append-only set of raw data, and (ii) to
pre-compute arbitrary query functions, called batch views.
Hadoop’s HDFS (http://is.gd/Emgj57) is typically used to store
the master dataset and perform the computation of the batch
views using MapReduce (http://is.gd/StjZaI).
• Serving layer: This layer indexes the batch views so that they
can be queried in ad hoc with low latency. To implement the
serving layer, usually technologies such as Apache HBase
(http://is.gd/2ro9CY) or ElephantDB (http://is.gd/KgIZ2G) are
utilized.The Apache Drill project (http://is.gd/wB1IYy) provides
the capability to execute full ANSI SQL 2003 queries against
batch views.
• Speed layer:This layer compensates for the high latency of updates
to the serving layer, due to the batch layer. Using fast and incre-
mental algorithms, the speed layer deals with recent data only.
Storm (http://is.gd/qP7fkZ) is often used to implement this layer.
• Queries:Last but not least,any incoming query can be answered
by merging results from batch views and real-time views.
Scope and Architecture of the Project
In this article, I employ the lambda architecture to implement what I
call UberSocialNet (USN). This open-source project enables users to
store and query acquaintanceship data. That is, I want to be able to
capture whether I happen to know someone from multiple social net-
works, such as Twitter or LinkedIn, or from real-life circumstances.The
aim is to scale out to several billions of users while providing low-la-
tency access to the stored information.To keep the system simple and
comprehensible,I limit myself to bulk import of the data (no capabili-
ties to live-stream data from social networks) and provide only a very
simple a command-line user interface. The guts, however, use the
lambda architecture.
It’s easiest to think about USN in terms of two orthogonal phases:
• Build-time, which includes the data pre-processing, generating
the master dataset as well as creating the batch views.
• Runtime,in which the data is actually used,primarily via issuing
queries against the data space.
The USN app architecture is shown below in Figure 2:
[LAMBDA]
November 2013 16
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
Figure 2:High-level architecture diagram of the USN app.
www.drdobbs.com
The following subsytems and processes, in line with the lambda ar-
chitecture,are at work in USN:
• Data pre-processing. Strictly speaking this can be considered
part of the batch layer. It can also be seen as an independent
process necessary to bring the data into a shape that is suitable
for the master dataset generation.
• The batch layer. Here, a bash shell script (http://is.gd/smhcl6)
is used to drive a number of HiveQL (http://is.gd/8qSOSF)
queries (see the GitHub repo, in the batch-layer folder at
http://is.gd/QDU6pH) that are responsible to load the pre-
processed input CSV data into HDFS.
• The serving layer. In this layer, we use a Python script
(http://is.gd/Qzklmw) that loads the data from HDFS via Hive and
inserts it into a HBase table, and hence creating a batch view of
the data.This layer also provides query capabilities,necessary in
the runtime phase to serve the front-end.
• Command-line front end.The USN app front-end is a bash shell
script (http://is.gd/nFZoqB) interacting with the end-user and
providing operations such as listings,lookups,and search.
This is all there is from an architectural point of view.You may have
noticed that there is no speed layer in USN, as of now. This is due to
the scope I initially introduced above.At the end of this article,I’ll revisit
this topic.
The USN App Technology Stack and Data
Recently, Dr. Dobb’s discussed Pydoop: Writing Hadoop Programs in
Python (http://www.drdobbs.com/240156473), which will serve as a
gentle introduction into setting up and using Hadoop with Python.I’m
going to use a mixture of Python and bash shell scripts to implement
the USN. However, I won’t rely on the low-level MapReduce API pro-
vided by Pydoop,but rather on higher-level libraries that interface with
Hive and HBase,which are part of Hadoop.Note that the entire source
code,including the test data and all queries as well as the front-end,is
available in a GitHub repository (http://is.gd/XFI4wY), and it is neces-
sary to follow along with this implementation.
Before I go into the technical details such as the concrete technology
stack used,let’s have a quick look at the data transformation happen-
ing between the batch and the serving layer (Figure 3).
[LAMBDA]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 17
Figure 3:Data transformation from batch to serving layer in the USN app.
www.drdobbs.com
As hinted in Figure 3, the master dataset (left) is a collection of
atomic actions: either a user has added someone to their networks
or the reverse has taken place, a person has been removed from a
network. This form of the data is as raw as it gets in the context of
our USN app and can serve as the basis for a variety of views that
are able to answer different sorts of queries. For simplicity’s sake, I
only consider one possible view that is used in the USN app front-
end: the “network-friends” view, per user, shown in the right part of
Figure 3.
Raw Input Data
The raw input data is a Comma Separated Value (CSV) file with the fol-
lowing format:
timestamp,originator,action,network,target,context
2012-03-12T22:54:13-07:00,Michael,ADD,I,Ora Hatfield, bla
2012-11-23T01:53:42-08:00,Ted,REMOVE,I,Marvin Garrison, meh
...
The raw CSV file contains the following six columns:
• timestamp is an ISO 8601 formatted date-time stamp that states
when the action was performed (range:January 2012 to May 2013).
• originator is the name of the person who added or removed
a person to or from one of his or her networks.
• action must be either ADD or REMOVE and designates the action
that has been carried out.That is, it indicates whether a person
has been added or removed from the respective network.
• network is a single character indicating the respective network
where the action has been performed. The possible values are:
I,in-real-life;T,Twitter;L,LinkedIn; F,Facebook;G,Google+
• target is the name of the person added to or removed from the
network.
• context is a free-text comment,providing a hint why the person
has been added/removed or where one has met the person in
the first place.
There are no optional fields in the dataset. In other words: Each row
is completely filled. In order to generate some test data to be used in
the USN app, I’ve created a raw input CSV file from generatedata.com
in five runs,yielding some 500 rows of raw data.
Technology Stack
USN uses several software frameworks,libraries,and components,as I
mentioned earlier.I’ve tested it with:
• Apache Hadoop 1.0.4 (http://is.gd/4suWof)
• Apache Hive 0.10.0 (http://is.gd/tOfbsP)
• Hiver for Hive access from Python (http://is.gd/OXujzB)
• Apache HBase 0.94.4 (http://is.gd/7VnBqR)
• HappyBase for HBase access from Python (http://is.gd/BuJzaH)
I assume that you’re familiar with the bash shell and have Python 2.7
or above installed. I’ve tested the USN app under Mac OS X 10.8 but
there are no hard dependencies on any Mac OS X specific features,so
it should run unchanged under any Linux environment.
Building the USN Data Space
The first step is to build the data space for the USN app, that is, the
master dataset and the batch view,and then we will have a closer look
behind the scenes of each of the commands.
[LAMBDA]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 18
www.drdobbs.com
First,some pre-processing of the raw data,generated earlier:
$ pwd
/Users/mhausenblas2/Documents/repos/usn-app/data
$ ./usn-preprocess.sh < usn-raw-data.csv > usn-base-data.csv
Next, we want to build the batch layer. For this, I first need to make
sure that the Hive Thrift service is running:
$ pwd
/Users/mhausenblas2/Documents/repos/usn-app/batch-layer
$ hive --service hiveserver
Starting Hive Thrift Server
...
Now,I can run the script that execute the Hive queries and builds our
USN app master dataset,like so:
$ pwd
/Users/mhausenblas2/Documents/repos/usn-app/batch-layer
$ ./batch-layer.sh INIT
USN batch layer created.
$ ./batch-layer.sh CHECK
The USN batch layer seems OK.
This generates the batch layer, which is in HDFS. Next, I create the
serving layer in HBase by building a view of the relationships to people.
For this, both the Hive and HBase Thrift services need to be running.
Below,you see how you start the HBase Thrift service:
$ echo $HBASE_HOME
/Users/mhausenblas2/bin/hbase-0.94.4
$ cd /Users/mhausenblas2/bin/hbase-0.94.4
$ ./bin/start-hbase.sh
starting master, logging to /Users/...
$ ./bin/hbase thrift start -p 9191
13/05/31 09:39:09 INFO util.VersionInfo: HBase 0.94.4
As now both Hive and HBaseThrift services are up and running,I can
run the following command (in the respective directory, wherever
you’ve unzipped or cloned the GitHub repository):
$ echo $HBASE_HOME
/Users/mhausenblas2/bin/hbase-0.94.4
$ cd /Users/mhausenblas2/bin/hbase-0.94.4
$ ./bin/start-hbase.sh
starting master, logging to /Users/...
$ ./bin/hbase thrift start -p 9191
13/05/31 09:39:09 INFO util.VersionInfo: HBase 0.94.4
Now,let’s have a closer look at what is happening behind the scenes
of each of the layers in the next sections.
The Batch Layer
The raw data is first pre-processed and loaded into Hive. In Hive (re-
member, this constitutes the master dataset in the batch layer of our
USN app) the following schema is used:
CREATE TABLE usn_base (
actiontime STRING,
originator STRING,
action STRING,
network STRING,
target STRING,
context STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’;
To import the CSV data, to build the master dataset, the shell script
batch-layer.sh executes the following HiveQL commands:
LOAD DATA LOCAL INPATH ‘../data/usn-base-data.csv’ INTO
TABLE usn_base;
[LAMBDA]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 19
www.drdobbs.com
DROP TABLE IF EXISTS usn_friends;
CREATE TABLE usn_friends AS
SELECT actiontime, originator AS username, network,
target AS friend, context AS note
FROM usn_base
WHERE action = ‘ADD’
ORDER BY username, network, username;
With this,the USN app master dataset is ready and available in HDFS
and I can move on to the next layer,the serving layer.
The Serving Layer of the USN App
The batch view used in the USN app is realized via an HBase table
called usn_friends.This table is then used to drive the USN app front-
end;it has the schema shown in Figure 4.
After building the serving layer, I can use the HBase shell to verify if
the batch view has been properly populated in the respective table
usn_friends:
$ ./bin/hbase shell
hbase(main):001:0> describe ‘usn_friends’
...
{NAME => ‘usn_friends’, FAMILIES => [{NAME => ‘a’,
DATA_BLOCK_ENCODING => ‘NONE’, BLOOMFILTER => ‘N true
ONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’,
COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL =>
‘-1’, KEEP_DELETED_CELLS => ‘false’, BLOCKSIZE =>
‘65536’, IN_MEMORY => ‘false’, ENCODE_ON_DISK =>
‘true’, BLOCKCACHE => ‘false’}]}
1 row(s) in 0.2450 seconds
You can have a look at some more queries used in the demo user in-
terface on theWiki page of the GitHub repository (http://is.gd/7v0IXz).
Putting It All Together
After the batch and serving layers have been initialized and launched,
as described, you can launch the user interface. To use the CLI, make
sure that HBase and the HBase Thrift service are running and then, in
the main USN app directory run:
$ ./usn-ui.sh
This is USN v0.0
u ... user listings, n ... network listings, l ... lookup,
s ... search, h ... help, q ... quit
[LAMBDA]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 20
Figure 4:HBase schema used in the serving layer of the USN app.
www.drdobbs.com
Figure 5 shows a screen shot of the USN app front-end in action.The
three main operations the USN front-end provides are as follows:
• u ... user listing lists all acquaintances of a user
• n ... network listing lists acquaintances of a user in a net-
work
• l ... lookup listing lists acquaintances of a user in a net-
work and allows restrictions on the time range (from/to) of the
acquaintanceship
• s ... search provides search for an acquaintance over all
users,allowing for partial match
An example USN app front-end session is available at the GitHub
repo (http://is.gd/c3i6FW) for you to study.
What’s Next?
I have intentionally kept USN simple. Although fully functional, it has
several intentional limitations (due to space restrictions here). I can
suggest several improvements you could have a go at,using the avail-
able code base (http://is.gd/XFI4wY) as a starting point.
• Bigger data:The most obvious point is not the app itself but the
data size.Only laughable 500 rows? This isn’t Big Data I hear you
say. Rightly so. Now, no one stops you generating 500 million
rows or more and try it out. Certain processes such as pre-pro-
cessing and the generating the layers will take longer but there
are no architectural changes necessary, and this is the whole
point of this USN app.
• Creating a full-blown batch layer: Currently, the batch layer is a
sort of one-shot,while it should really run in a loop and append
new data. This requires partitioning of the ingested data and
some checks. Pail (http://is.gd/sJAKGN), for example, allows you
to do the ingestion and partitioning in a very elegant way.
• Adding speed layer and automated import: It would be inter-
esting to automate the import of data from the various social
networks. For example, Google Takeout (http://is.gd/Zy0HcB)
allows exporting all data in bulk mode,including G+ Circles.For
a stab at the speed layer, one could try and utilize the Twitter
fire-hose (http://is.gd/xVroGO) along with Storm.
• More batch views:There is currently only one view (friend list per
network, per user) in the serving layer.The USN app might ben-
efit from different views to enable different queries most effi-
ciently,such as time-series views of network growth or overlaps
of acquaintanceships across networks.
[LAMBDA]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 21
Figure 5:Screen-shot of the USN app command line user interface.
www.drdobbs.com
I hope you have as much fun playing around with the USN app and
extending it as I had writing it in the first place. I’d love to hear back
from you on ideas or further improvements either directly here as a
comment or via the GitHub issue tracker of the USN app repository.
Further Resources
• A must-read for the Lambda Architecture is the Big Data book
by Nathan Marz and James Warren from Manning
(http://is.gd/lPtVJS).The USN app idea actually stems from one
of the examples used in this book.
• Slide deck on a real time architecture using Hadoop and Storm
(http://is.gd/nz0wD6) from FOSDEM 2013.
• A blog post about an example “lambda architecture” for real-
time analysis of hashtags usingTrident,Hadoop,and Splout SQL
(http://is.gd/ZTJarF).
• Additional batch layer technologies such as Pail
(http://is.gd/sJAKGN) for managing the master dataset and
JCascalog (http://is.gd/i7jf1W) for creating the batch views.
• Apache Drill (http://is.gd/wB1IYy) for providing interactive, ad-
hoc queries against HDFS,HBase,or other NoSQL back-ends.
• Additional speed layer technologies, such as Trident
(http://is.gd/Bxqt9j),a high-level abstraction for doing real-time
computing on top of Storm and MapR’s Direct Access NFS
(http://is.gd/BaoE0l) to land data directly from streaming sources
such as social media streams or sensor devices.
—MichaelHausenblasistheChiefDataEngineerEMEA,MapRTechnologies.
[LAMBDA]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 22
November 2013 23www.drdobbs.com
Easy,Real-Time Big Data Analysis
Using Storm
Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze.
By Shruthi Kumar and Siddharth Patankar
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
oday,companies regularly generate terabytes of data in their
daily operations. The sources include everything from data
captured from network sensors, to the Web, social media,
transactional business data, and data created in other busi-
ness contexts. Given the volume of data being generated, real-time
computation has become a major challenge faced by many organiza-
tions. A scalable real-time computation system that we have used ef-
fectively is the open-source Storm tool,which was developed atTwitter
and is sometimes referred to as “real-time Hadoop.” However, Storm
(http://storm-project.net/) is far simpler to use than Hadoop in that it
does not require mastering an alternate universe of new technologies
simply to handle big data jobs.
This article explains how to use Storm. The example project, called
“Speeding Alert System,” analyzes real-time data and raises a trigger
and relevant data to a database, when the speed of a vehicle exceeds
a predefined threshold.
Storm
Whereas Hadoop relies on batch processing, Storm is a real-time, dis-
tributed, fault-tolerant, computation system. Like Hadoop, it can
process huge amounts of data — but does so in real time — with guar-
anteed reliability; that is, every message will be processed. Storm also
offers features such as fault tolerance and distributed computation,
which make it suitable for processing huge amounts of data on differ-
ent machines.It has these features as well:
• It has simple scalability. To scale, you simply add machines and
change parallelism settings of the topology. Storm’s usage of
T
From the Vault
Hadoop’s Zookeeper for cluster coordination makes it scalable
for large cluster sizes.
• It guarantees processing of every message.
• Storm clusters are easy to manage.
• Storm is fault tolerant:Once a topology is submitted,Storm runs
the topology until it is killed or the cluster is shut down. Also, if
there are faults during execution, reassignment of tasks is han-
dled by Storm.
• Topologies in Storm can be defined in any language, although
typically Java is used.
To follow the rest of the article, you first need to install and set up
Storm.The steps are straightforward:
• Download the Storm archive from the official Storm website
(http://storm-project.net/downloads.html).
• Unpack the bin/ directory onto your PATH and make sure the
bin/storm script is executable.
Storm Components
A Storm cluster mainly consists of a master and worker node,with co-
ordination done by Zookeeper.
• Master Node: The master node runs a daemon,Nimbus,which is
responsible for distributing the code around the cluster,assign-
ing the tasks, and monitoring failures. It is similar to the Job
Tracker in Hadoop.
• Worker Node:The worker node runs a daemon,Supervisor,which
listens to the work assigned and runs the worker process based
on requirements.Each worker node executes a subset of a topol-
ogy.The coordination between Nimbus and several supervisors
is managed by a Zookeeper system or cluster.
Zookeeper
Zookeeper is responsible for maintaining the coordination service be-
tween the supervisor and master.The logic for a real-time application
is packaged into a Storm “topology.” A topology consists of a graph of
spouts (data sources) and bolts (data operations) that are connected
with stream groupings (coordination). Let’s look at these terms in
greater depth.
• Spout: In simple terms, a spout reads the data from a source for
use in the topology.A spout can either be reliable or unreliable.
A reliable spout makes sure to resend a tuple (which is an or-
dered list of data items) if Storm fails to process it.An unreliable
spout does not track the tuple once it’s emitted. The main
method in a spout is nextTuple().This method either emits a
new tuple to the topology or it returns if there is nothing to emit.
• Bolt: A bolt is responsible for all the processing that happens in a
topology. Bolts can do anything from filtering to joins, aggrega-
tions, talking to files/databases, and so on. Bolts receive the data
from a spout for processing,which may further emit tuples to an-
other bolt in case of complex stream transformations.The main
method in a bolt is execute(), which accepts a tuple as input. In
both the spout and bolt,to emit the tuple to more than one stream,
the streams can be declared and specified in declareStream().
• Stream Groupings: A stream grouping defines how a stream
should be partitioned among the bolt’s tasks.There are built-in
www.drdobbs.com
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 24
www.drdobbs.com
stream groupings (http://is.gd/eJvL0f) provided by Storm:shuf-
fle grouping, fields grouping, all grouping, one grouping, direct
grouping, and local/shuffle grouping. Custom implementation
by using the CustomStreamGrouping interface can also be
added.
Implementation
For our use case,we designed one topology of spout and bolt that can
process a huge amount of data (log files) designed to trigger an alarm
when a specific value crosses a predefined threshold. Using a Storm
topology, the log file is read line by line and the topology is designed
to monitor the incoming data.In terms of Storm components,the spout
reads the incoming data. It not only reads the data from existing files,
but it also monitors for new files. As soon as a file is modified, spout
reads this new entry and,after converting it to tuples (a format that can
be read by a bolt), emits the tuples to the bolt to perform threshold
analysis,which finds any record that has exceeded the threshold.
The next section explains the use case in detail.
Threshold Analysis
In this article,we will be mainly concentrating on two types of thresh-
old analysis:instant threshold and time series threshold.
• Instant threshold checks if the value of a field has exceeded the
threshold value at that instant and raises a trigger if the condi-
tion is satisfied. For example, it raises a trigger if the speed of a
vehicle exceeds 80 km/h.
• Time series threshold checks if the value of a field has exceeded
the threshold value for a given time window and raises a trig-
ger if the same is satisfied. For example, it raises a trigger if the
speed of a vehicle exceeds 80 km/h more than once in last five
minutes.
Listing One shows a log file of the type we’ll use,which contains ve-
hicle data information such as vehicle number,speed at which the ve-
hicle is traveling,and location in which the information is captured.
Listing One: A log file with entries of vehicles passing
through the checkpoint.
AB 123, 60, North city
BC 123, 70, South city
CD 234, 40, South city
DE 123, 40, East city
EF 123, 90, South city
GH 123, 50, West city
A corresponding XML file is created, which consists of the schema
for the incoming data. It is used for parsing the log file. The schema
XML and its corresponding description are shown in the Table 1.
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 25
Table 1.
www.drdobbs.com
The XML file and the log file are in a directory that is monitored by
the spout constantly for real-time changes. The topology we use for
this example is shown in Figure 1.
As shown in Figure 1,the FileListenerSpout accepts the input log
file, reads the data line by line, and emits the data to the Thresold-
CalculatorBolt for further threshold processing.Once the process-
ing is done, the contents of the line for which the threshold is calcu-
lated is emitted to the DBWriterBolt, where it is persisted in the
database (or an alert is raised). The detailed implementation for this
process is explained next.
Spout Implementation
Spout takes a log file and the XML descriptor file as the input.The XML
file consists of the schema corresponding to the log file.Let us consider
an example log file,which has vehicle data information such as vehicle
number,speed at which the vehicle is travelling,and location in which
the information is captured.(See Figure 2.)
Listing Two shows the specific XML file for a tuple, which specifies
the fields and the delimiter separating the fields in a log file. Both the
XML file and the data are kept in a directory whose path is specified
in the spout.
Listing Two: An XML file created for describing the log
file.
<TUPLEINFO>
<FIELDLIST>
<FIELD>
<COLUMNNAME>vehicle_number</COLUMNNAME>
<COLUMNTYPE>string</COLUMNTYPE>
</FIELD>
<FIELD>
<COLUMNNAME>speed</COLUMNNAME>
<COLUMNTYPE>int</COLUMNTYPE>
</FIELD>
<FIELD>
<COLUMNNAME>location</COLUMNNAME>
<COLUMNTYPE>string</COLUMNTYPE>
</FIELD>
</FIELDLIST>
<DELIMITER>,</DELIMITER>
</TUPLEINFO>
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 26
Figure 1:Topology created in Storm to process real-time data.
Figure 2:Flow of data from log files to Spout.
www.drdobbs.com
An instance of spout is initialized with constructor parameters of Di-
rectory, Path, and TupleInfo object.The TupleInfo object stores
necessary information related to log file such as fields, delimiter, and
type of field. This object is created by serializing the XML file using
XStream (http://xstream.codehaus.org/).
Spout implementation steps are:
• Listen to changes on individual log files. Monitor the directory
for the addition of new log files.
• Convert rows read by the spout to tuples after declaring fields
for them.
• Declare the grouping between spout and bolt,deciding the way
in which tuples are given to bolt.
The code for spout is shown in Listing Three.
Listing Three: Logic in Open, nextTuple, and declareOutputFields
methods of spout.
public void open( Map conf, TopologyContext
context,SpoutOutputCollector collector )
{
_collector = collector;
try
{
fileReader =
new BufferedReader(new FileReader(new File(file)));
}
catch (FileNotFoundException e)
{
System.exit(1);
}
}
public void nextTuple()
{
protected void ListenFile(File file)
{
Utils.sleep(2000);
RandomAccessFile access = null;
String line = null;
try
{
while ((line = access.readLine()) != null)
{
if (line !=null)
{
String[] fields=null;
if (tupleInfo.getDelimiter().equals(“|”))
fields = line.split
(“”+tupleInfo.getDelimiter());
else
fields =
line.split(tupleInfo.getDelimiter());
if (tupleInfo.getFieldList().size() == fields.length)
_collector.emit(new Values(fields));
}
}
}
catch (IOException ex) { }
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer)
{
String[] fieldsArr =
new String [tupleInfo.getFieldList().size()];
for(int i=0; i<tupleInfo.getFieldList().size(); i++)
{
fieldsArr[i] =
tupleInfo.getFieldList().get(i).getColumnName();
}
declarer.declare(new Fields(fieldsArr));
}
declareOutputFields() decides the format in which the tuple is
emitted, so that the bolt can decode the tuple in a similar fashion.
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 27
www.drdobbs.com
Spout keeps on listening to the data added to the log file and as soon
as data is added,it reads and emits the data to the bolt for processing.
Bolt Implementation
The output of spout is given to bolt for further processing.The topol-
ogy we have considered for our use case consists of two bolts as
shown in Figure 3.
ThresholdCalculatorBolt
The tuples emitted by spout are received by the ThresholdCalcu-
latorBolt for threshold processing. It accepts several inputs for
threshold check.The inputs it accepts are:
• Threshold value to check
• Threshold column number to check
• Threshold column data type
• Threshold check operator
• Threshold frequency of occurrence
• Threshold time window
A class,shown Listing Four,is defined to hold these values.
Listing Four: ThresholdInfo class.
public class ThresholdInfo implements Serializable
{
private String action;
private String rule;
private Object thresholdValue;
private int thresholdColNumber;
private Integer timeWindow;
private int frequencyOfOccurence;
}
Based on the values provided in fields, the threshold check is made
in the execute() method as shown in Listing Five. The code mostly
consists of parsing and checking the incoming values.
Listing Five: Code for threshold check.
public void execute(Tuple tuple, BasicOutputCollector collector)
{
if(tuple!=null)
{
List<Object> inputTupleList =
(List<Object>) tuple.getValues();
int thresholdColNum =
thresholdInfo.getThresholdColNumber();
Object thresholdValue = thresholdInfo.getThresholdValue();
String thresholdDataType =
tupleInfo.getFieldList().get(thresholdColNum-1)
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 28
Figure 3:Flow of data from Spout to Bolt.
www.drdobbs.com
.getColumnType();
Integer timeWindow = thresholdInfo.getTimeWindow();
int frequency = thresholdInfo.getFrequencyOfOccurence();
if(thresholdDataType.equalsIgnoreCase(“string”))
{
String valueToCheck =
inputTupleList.get(thresholdColNum-1).toString();
String frequencyChkOp = thresholdInfo.getAction();
if(timeWindow!=null)
{
long curTime = System.currentTimeMillis();
long diffInMinutes = (curTime-startTime)/(1000);
if(diffInMinutes>=timeWindow)
{
if(frequencyChkOp.equals(“==”))
{
if(valueToCheck.equalsIgnoreCase
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit
(inputTupleList,collector);
}
}
else if(frequencyChkOp.equals(“!=”))
{
if(!valueToCheck.equalsIgnoreCase
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit(inputTupleList,
collector);
}
}
else
System.out.println(“Operator not
supported”);
}
}
else
{
if(frequencyChkOp.equals(“==”))
{
if(valueToCheck.equalsIgnoreCase
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit(inputTupleList,collector);
}
}
else if(frequencyChkOp.equals(“!=”))
{
if(!valueToCheck.equalsIgnoreCase
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit(inputTupleList,collector);
}
}
}
}
else if(thresholdDataType.equalsIgnoreCase(“int”) ||
thresholdDataType.equalsIgnoreCase(“double”) ||
thresholdDataType.equalsIgnoreCase(“float”) ||
thresholdDataType.equalsIgnoreCase(“long”) ||
thresholdDataType.equalsIgnoreCase(“short”))
{
String frequencyChkOp = thresholdInfo.getAction();
if(timeWindow!=null)
{
long valueToCheck =
Long.parseLong(inputTupleList.
get(thresholdColNum-1).toString());
long curTime = System.currentTimeMillis();
long diffInMinutes =
(curTime-startTime)/(1000);
System.out.println(“Difference
in minutes=”+diffInMinutes);
if(diffInMinutes>=timeWindow)
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 29
www.drdobbs.com
{
if(frequencyChkOp.equals(“<”))
{
if(valueToCheck < Double.parseDouble
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit(inputTupleList,collector);
}
}
else if(frequencyChkOp.equals(“>”))
{
if(valueToCheck > Double.parseDouble
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit(inputTupleList,collector);
}
}
else if(frequencyChkOp.equals(“==”))
{
if(valueToCheck ==
Double.parseDouble
(thresholdValue.toString()))
{
count.incrementAndGet();
if(count.get() > frequency)
splitAndEmit(inputTupleList,collector);
}
}
else if(frequencyChkOp.equals(“!=”))
{
. . .
}
}
}
else
splitAndEmit(null,collector);
}
else
{
System.err.println(“Emitting null in bolt”);
splitAndEmit(null,collector);
}
}
The tuples emitted by the threshold bolt are passed to the next cor-
responding bolt,which is the DBWriterBolt bolt in our case.
DBWriterBolt
The processed tuple has to be persisted for raising a trigger or for fur-
ther use.DBWriterBolt does the job of persisting the tuples into the
database.The creation of a table is done in prepare(), which is the
first method invoked by the topology. Code for this method is given
in Listing Six.
Listing Six: Code for creation of tables.
public void prepare( Map StormConf, TopologyContext context )
{
try
{
Class.forName(dbClass);
}
catch (ClassNotFoundException e)
{
System.out.println(“Driver not found”);
e.printStackTrace();
}
try
{
connection driverManager.getConnection(
“jdbc:mysql://”+databaseIP+”:
”+databasePort+”/”+databaseName, userName, pwd);
connection.prepareStatement
(“DROP TABLE IF EXISTS “+tableName).execute();
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 30
www.drdobbs.com
StringBuilder createQuery = new StringBuilder(
“CREATE TABLE IF NOT EXISTS “+tableName+”(“);
for(Field fields : tupleInfo.getFieldList())
{
if(fields.getColumnType().equalsIgnoreCase(“String”))
createQuery.append(fields.getColumnName()+”
VARCHAR(500),”);
else
createQuery.append(fields.getColumnName()+”
“+fields.getColumnType()+”,”);
}
createQuery.append(“thresholdTimeStamp timestamp)”);
connection.prepareStatement(createQuery.toString()).execute();
// Insert Query
StringBuilder insertQuery = new StringBuilder(“INSERT INTO
“+tableName+”(“);
String tempCreateQuery = new String();
for(Field fields : tupleInfo.getFieldList())
{
insertQuery.append(fields.getColumnName()+”,”);
}
insertQuery.append(“thresholdTimeStamp”).append(“) values (“);
for(Field fields : tupleInfo.getFieldList())
{
insertQuery.append(“?,”);
}
insertQuery.append(“?)”);
prepStatement =
connection.prepareStatement(insertQuery.toString());
}
catch (SQLException e)
{
e.printStackTrace();
}
}
Insertion of data is done in batches.The logic for insertion is provided
in execute() as shown in Listing Seven,and consists mostly of pars-
ing the variety of different possible input types.
Listing Seven: Code for insertion of data.
public void execute(Tuple tuple, BasicOutputCollector collector)
{
batchExecuted=false;
if(tuple!=null)
{
List<Object> inputTupleList = (List<Object>) tuple.getValues();
int dbIndex=0;
for(int i=0;i<tupleInfo.getFieldList().size();i++)
{
Field field = tupleInfo.getFieldList().get(i);
try {
dbIndex = i+1;
if(field.getColumnType().equalsIgnoreCase(“String”))
prepStatement.setString(dbIndex,
inputTupleList.get(i).toString());
else if(field.getColumnType().equalsIgnoreCase(“int”))
prepStatement.setInt(dbIndex,
Integer.parseInt(inputTupleList.get(i).toString()));
else if(field.getColumnType().equalsIgnoreCase(“long”))
prepStatement.setLong(dbIndex,
Long.parseLong(inputTupleList.get(i).toString()));
else if(field.getColumnType().equalsIgnoreCase(“float”))
prepStatement.setFloat(dbIndex,
Float.parseFloat(inputTupleList.get(i).toString()));
else if(field.getColumnType().
equalsIgnoreCase(“double”))
prepStatement.setDouble(dbIndex,
Double.parseDouble(inputTupleList.get(i).toString()));
else if(field.getColumnType().equalsIgnoreCase(“short”))
prepStatement.setShort(dbIndex,
Short.parseShort(inputTupleList.get(i).toString()));
else if(field.getColumnType().
equalsIgnoreCase(“boolean”))
prepStatement.setBoolean(dbIndex,
Boolean.parseBoolean(inputTupleList.get(i).toString()));
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 31
www.drdobbs.com
else if(field.getColumnType().equalsIgnoreCase(“byte”))
prepStatement.setByte(dbIndex,
Byte.parseByte(inputTupleList.get(i).toString()));
else if(field.getColumnType().equalsIgnoreCase(“Date”))
{
Date dateToAdd=null;
if (!(inputTupleList.get(i) instanceof Date))
{
DateFormat df = new SimpleDateFormat
(“yyyy-MM-dd hh:mm:ss”);
try
{
dateToAdd =
df.parse(inputTupleList.get(i).toString());
}
catch (ParseException e)
{
System.err.println(“Data type not valid”);
}
}
else
{
dateToAdd = (Date)inputTupleList.get(i);
java.sql.Date sqlDate = new java.sql.
Date(dateToAdd.getTime());
prepStatement.setDate(dbIndex, sqlDate);
}
}
catch (SQLException e)
{
e.printStackTrace();
}
}
Date now = new Date();
try
{
prepStatement.setTimestamp(dbIndex+1,
new java.sql.Timestamp(now.getTime()));
prepStatement.addBatch();
counter.incrementAndGet();
if (counter.get()== batchSize)
executeBatch();
}
catch (SQLException e1)
{
e1.printStackTrace();
}
}
else
{
long curTime = System.currentTimeMillis();
long diffInSeconds = (curTime-startTime)/(60*1000);
if(counter.get() <
batchSize && diffInSeconds>batchTimeWindowInSeconds)
{
try {
executeBatch();
startTime = System.currentTimeMillis();
}
catch (SQLException e) {
e.printStackTrace();
}
}
}
}
public void executeBatch() throws SQLException
{
batchExecuted=true;
prepStatement.executeBatch();
counter = new AtomicInteger(0);
}
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 32
www.drdobbs.com
Once the spout and bolt are ready to be executed,a topology is built
by the topology builder to execute it.The next section explains the ex-
ecution steps.
Running and Testing the Topology in a Local Cluster
Define the topology using TopologyBuilder,which exposes the Java
API for specifying a topology for Storm to execute:
• Using Storm Submitter, we submit the topology to the cluster. It
takes name of the topology,configuration,and topology as input.
• Submit the topology.
Listing Eight: Building and executing a topology.
public class StormMain
{
public static void main(String[] args)
throws AlreadyAliveException,
InvalidTopologyException,
InterruptedException
{
ParallelFileSpout parallelFileSpout =
new ParallelFileSpout();
ThresholdBolt thresholdBolt = new ThresholdBolt();
DBWriterBolt dbWriterBolt = new DBWriterBolt();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“spout”, parallelFileSpout, 1);
builder.setBolt(“thresholdBolt”, thresholdBolt,1).
shuffleGrouping(“spout”);
builder.setBolt(“dbWriterBolt”,dbWriterBolt,1).
shuffleGrouping(“thresholdBolt”);
if(this.argsMain!=null && this.argsMain.length > 0)
{
conf.setNumWorkers(1);
StormSubmitter.submitTopology(
this.argsMain[0], conf,
builder.createTopology());
}
else
{
Config conf = new Config();
conf.setDebug(true);
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(
“Threshold_Test”, conf, builder.createTopology());
}
}
}
After building the topology,it is submitted to the local cluster.Once
the topology is submitted,it runs until it is explicitly killed or the cluster
is shut down without requiring any modifications.This is another big
advantage of Storm.
This comparatively simple example shows the ease with which it’s
possible to set up and use Storm once you understand the basic con-
cepts of topology, spout, and bolt. The code is straightforward and
both scalability and speed are provided by Storm. So, if you’re look-
ing to handle big data and don’t want to traverse the Hadoop uni-
verse, you might well find that using Storm is a simple and elegant
solution.
—ShruthiKumarworksasatechnologyanalystandSiddharthPatankarisasoftware
engineerwiththeCloudCenterofExcellenceatInfosysLabs.
[STORM]
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
November 2013 33
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
www.drdobbs.com
Items of special interest posted on
www.drdobbs.com over the past
month that you may have missed
IF JAVA IS DYING,
IT SURE LOOKS AWFULLY HEALTHY
The odd, but popular, assertion that Java is dying can be made only
in spite of the evidence, not because of it.
http://www.drdobbs.com/240162390
CONTINUOUS DELIVERY: THE FIRST STEPS
Continuous delivery integrates many practices that in their totality
might seem daunting. But starting with a few basic steps brings im-
mediate benefits.Here’s how.
http://www.drdobbs.com/240161356
A SIMPLE, IMMUTABLE,
NODE-BASED DATA STRUCTURE
Array-like data structures aren’t terribly useful in a world that doesn’t
allow data to change because it’s hard to implement even such simple
operations as appending to an array efficiently.The difficulty is that in
an environment with immutable data, you can’t just append a value
to an array; you have to create a new array that contains the old array
along with the value that you want to append.
http://www.drdobbs.com/240162122
DIJKSTRA’S 3 RULES FOR PROJECT SELECTION
Want to start a unique and truly useful open-source project? These
three guidelines on choosing wisely will get you there.
http://www.drdobbs.com/240161615
PRIMITIVE VERILOG
Verilog is decidedly schizophrenic. There is part of the Verilog lan-
guage that synthesizers can commonly convert into FPGA logic, and
then there is an entire part of the language that doesn’t synthesize.
http://www.drdobbs.com/240162355
DEVELOPING ANDROID APPS WITH SCALA AND
SCALOID: PART 2
Starting with templates, Android features can be added quickly with
a single line of DSL code.
http://www.drdobbs.com/240162204
FIDGETY USB
Linux-based boards like the Raspberry Pi or the Beagle Bone usually
have some general-purpose I/O capability,but it is easy to forget they
also sport USB ports.
http://www.drdobbs.com/240162050
This Month on DrDobbs.com
[LINKS]
November 2013 34
INFORMATIONWEEK
Rob Preston VP and Editor In Chief,Information-
Week
rob.preston@ubm.com 516-562-5692
Chris Murphy Editor,InformationWeek
chris.murphy@ubm.com 414-906-5331
Lorna Garey Content Director,Reports,Informa-
tionWeek
lorna.garey@ubm.com 978-694-1681
Brian Gillooly,VP and Editor In Chief,Events
brian.gillooly@ubm.com
INFORMATIONWEEK.COM
Laurianne McLaughlin Editor
laurianne.mclaughlin@ubm.com 516-562-5336
Roma Nowak Senior Director,
Online Operations and Production
roma.nowak@ubm.com 516-562-5274
Joy Culbertson Web Producer
joy.culbertson@ubm.com
Atif Malik Director,
Web Development
atif.malik@ubm.com
MEDIA KITS
http://createyournextcustomer.techweb.com/media-
kit/business-technology-audience-media-kit/
UBM TECH
AUDIENCE DEVELOPMENT
Director,Karen McAleer
(516) 562-7833, karen.mcaleer@ubm.com
SALES CONTACTS—WEST
Western U.S.(Pacific and Mountain states)
and Western Canada (British Columbia,
Alberta)
Sales Director,Michele Hurabiell
(415) 378-3540,michele.hurabiell@ubm.com
Strategic Accounts
Account Director,Sandra Kupiec
(415) 947-6922,sandra.kupiec@ubm.com
Account Manager,Vesna Beso
(415) 947-6104, vesna.beso@ubm.com
Account Executive,Matthew Cohen-Meyer
(415) 947-6214, matthew.meyer@ubm.com
MARKETING
VP,Marketing,Winnie Ng-Schuchman
(631) 406-6507, winnie.ng@ubm.com
Marketing Director,Angela Lee-Moll
(516) 562-5803,angele.leemoll@ubm.com
Marketing Manager,Monique Luttrell
(949) 223-3609, monique.luttrell@ubm.com
Program Manager,Nicole Schwartz
516-562-7684,nicole.schwartz@ubm.com
SALES CONTACTS—EAST
Midwest,South,Northeast U.S.and Eastern
Canada (Saskatchewan,Ontario,Quebec,New
Brunswick)
District Manager,Steven Sorhaindo
(212) 600-3092,steven.sorhaindo@ubm.com
Strategic Accounts
District Manager,Mary Hyland
(516) 562-5120, mary.hyland@ubm.com
Account Manager,Tara Bradeen
(212) 600-3387, tara.bradeen@ubm.com
Account Manager,Jennifer Gambino
(516) 562-5651, jennifer.gambino@ubm.com
Account Manager,Elyse Cowen
(212) 600-3051, elyse.cowen@ubm.com
Sales Assistant, Kathleen Jurina
(212) 600-3170, kathleen.jurina@ubm.com
BUSINESS OFFICE
General Manager,
Marian Dujmovits
United Business Media LLC
600 Community Drive
Manhasset,N.Y.11030
(516) 562-5000
Copyright 2013.
All rights reserved.
November 2013 35www.drdobbs.com
UBM TECH
Paul Miller,CEO
Robert Faletra,CEO,Channel
Kelley Damore,Chief Community Officer
Marco Pardi,President,Business
Technology Events
Adrian Barrick,Chief Content Officer
David Michael,Chief Information Officer
Sandra Wallach CFO
Simon Carless,EVP,Game & App
Development and Black Hat
Lenny Heymann,EVP,New Markets
Angela Scalpello,SVP,People & Culture
Andy Crow,Interim Chief of Staff
UNITED BUSINESS MEDIA LLC
Pat Nohilly Sr.VP,Strategic Development
and Business Administration
Marie Myers Sr.VP,
Manufacturing
UBM TECH ONLINE COMMUNITIES
Bank Systems & Tech
Dark Reading
DataSheets.com
Designlines
Dr.Dobb’s
EBN
EDN
EE Times
EE Times University
Embedded
Gamasutra
GAO
Heavy Reading
InformationWeek
IW Education
IW Government
IW Healthcare
Insurance & Technology
Light Reading
Network Computing
Planet Analog
Pyramid Research
TechOnline
Wall Street & Tech
UBM TECH EVENT COMMUNITIES
4G World
App Developers Conference
ARM TechCon
Big Data Conference
Black Hat
Cloud Connect
DESIGN
DesignCon
E2
Enterprise Connect
ESC
Ethernet Expo
GDC
GDC China
GDC Europe
GDC Next
GTEC
HDI Conference
Independent Games Festival
Interop
Mobile Commerce World
Online Marketing Summit
Telco Vision
Tower & Cell Summit
http://createyournextcustomer.techweb.com
AndrewBinstock Editor in Chief,Dr.Dobb’s
andrew.binstock@ubm.com
DeirdreBlake Managing Editor,Dr.Dobb’s
deirdre.blake@ubm.com
AmyStephens Copyeditor,Dr.Dobb’s
amy.stephens@ubm.com
JonErickson Editor in Chief Emeritus,Dr.Dobb’s
CONTRIBUTING EDITORS
Scott Ambler
Mike Riley
Herb Sutter
DR.DOBB’S UBM TECH
EDITORIAL 303 Second Street,
751 Laurel Street #614 Suite 900,SouthTower
San Carlos,CA San Francisco,CA 94107
94070 1-415-947-6000
USA
INTHISISSUE
GuestEditorial>>
News >>
Open-SourceDashboard>>
WhatBigDataCanDeliver>>
Lambda>>
Storm>>
Links>>
TableofContents >>
Entire contents Copyright ©
2013, UBM Tech/United Busi-
nessMediaLLC,exceptwhere
otherwise noted. No portion
of this publication may be re-
produced,stored,transmitted
in any form, including com-
puter retrieval, without writ-
ten permission from the
publisher. All Rights Re-
served. Articles express the
opinion of the author and are
notnecessarilytheopinionof
the publisher. Published by
UBM Tech/United Business
Media, 303 Second Street,
Suite 900 South Tower, San
Francisco, CA 94107 USA
415-947-6000.

More Related Content

Viewers also liked

Guinness book of world records
Guinness book of world recordsGuinness book of world records
Guinness book of world records
zachstrock21
 
2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...
2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...
2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...
Komandur Sunder Raj, P.E.
 
2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...
2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...
2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...
Komandur Sunder Raj, P.E.
 
2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...
2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...
2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...
Komandur Sunder Raj, P.E.
 
2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...
2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...
2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...
Komandur Sunder Raj, P.E.
 
2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...
2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...
2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...
Komandur Sunder Raj, P.E.
 
2005 ASME Power Conference Performance Considerations in Replacement of Low P...
2005 ASME Power Conference Performance Considerations in Replacement of Low P...2005 ASME Power Conference Performance Considerations in Replacement of Low P...
2005 ASME Power Conference Performance Considerations in Replacement of Low P...
Komandur Sunder Raj, P.E.
 
2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation
2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation
2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation
Komandur Sunder Raj, P.E.
 
2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...
2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...
2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...
Komandur Sunder Raj, P.E.
 
2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...
2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...
2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...
Komandur Sunder Raj, P.E.
 
2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...
2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...
2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...
Komandur Sunder Raj, P.E.
 
2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...
2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...
2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...
Komandur Sunder Raj, P.E.
 
2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...
2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...
2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...
Komandur Sunder Raj, P.E.
 

Viewers also liked (15)

Guinness book of world records
Guinness book of world recordsGuinness book of world records
Guinness book of world records
 
Guinness book of world records
Guinness book of world recordsGuinness book of world records
Guinness book of world records
 
2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...
2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...
2005 ASME Power Conference Lessons Learned - Low Condenser Vacuum/High Dissol...
 
Quit Hitting Yourself
Quit Hitting YourselfQuit Hitting Yourself
Quit Hitting Yourself
 
2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...
2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...
2005 ASME Power Conference Performance Considerations in Power Uprates of Nuc...
 
2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...
2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...
2007 ASME Power Conference Maximizing Value of Existing Nuclear Power Plant G...
 
2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...
2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...
2004 ASME Power Conference Capacity Losses in Nuclear Plants - A Case Study S...
 
2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...
2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...
2013 ASME Power Conference Maximizing Power Generating Asset Value Sunder Raj...
 
2005 ASME Power Conference Performance Considerations in Replacement of Low P...
2005 ASME Power Conference Performance Considerations in Replacement of Low P...2005 ASME Power Conference Performance Considerations in Replacement of Low P...
2005 ASME Power Conference Performance Considerations in Replacement of Low P...
 
2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation
2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation
2015 Lehigh Valley Engineers Week Keynote Speech Sunder Raj Presentation
 
2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...
2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...
2014 ASME Power Conference Performance/Condition Monitoring & Optimization fo...
 
2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...
2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...
2003 ASME Power Conference Heat Balance Techniques for Diagnosing and Evaluat...
 
2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...
2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...
2005 ASME Power Conference Analysis of Turbine Cycle Performance Losses Using...
 
2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...
2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...
2014 EPRI Conference Impact of Developments in HEI Correction Factors on Cond...
 
2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...
2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...
2012 ICONE20 Power Conference Developing Nuclear Power Plant TPMS Specificati...
 

Similar to DDJ_102113

NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
Capgemini
 
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
marksimpsongw
 
CS828 P5 Individual Project v101
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101
ThienSi Le
 
Sql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White Paper
Sql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White PaperSql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White Paper
Sql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White Paper
David J Rosenthal
 

Similar to DDJ_102113 (20)

On nosql
On nosqlOn nosql
On nosql
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
NoSQL and Hadoop: A New Generation of Databases - Changing the Game: Monthly ...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NoSQL
NoSQLNoSQL
NoSQL
 
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
 
SQL vs NoSQL deep dive
SQL vs NoSQL deep diveSQL vs NoSQL deep dive
SQL vs NoSQL deep dive
 
NoSQL
NoSQLNoSQL
NoSQL
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDB
 
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
Mark Simpson - UKOUG23 - Refactoring Monolithic Oracle Database Applications ...
 
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
 
CS828 P5 Individual Project v101
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technology
 
No sql database
No sql databaseNo sql database
No sql database
 
Sql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White Paper
Sql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White PaperSql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White Paper
Sql Server 2014 Platform for Hybrid Cloud Technical Decision Maker White Paper
 
Livre blanc Windows Azure No SQL
Livre blanc Windows Azure No SQLLivre blanc Windows Azure No SQL
Livre blanc Windows Azure No SQL
 
Couchbase - Introduction
Couchbase - IntroductionCouchbase - Introduction
Couchbase - Introduction
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 

DDJ_102113

  • 2. November 2013 2 C O N T E N T S COVER ARTICLE 8 Understanding What Big Data Can Deliver By Aaron Kimball It’s easy to err by pushing data to fit a projected model. Insights come,however,from accepting the data’s ability to depict what is going on,without imposing an a priori bias. GUEST EDITORIAL 3 Do All Roads Lead Back to SQL? By Seth Proctor After distancing themselves from SQL,NoSQL products are mov- ing towards transactional models as “NewSQL” gains popularity. What happened? FEATURES 15 Applying the Big Data Lambda Architecture By Michael Hausenblas A look inside a Hadoop-based project that matches connections in socialmediabyleveragingthehighlyscalablelambdaarchitecture. 23 From the Vault:Easy Real-Time Big Data Analysis Using Storm By Shruthi Kumar and Siddharth Patankar If you're looking to handle big data and don't want to tra- verse the Hadoop universe,you might well find that using Storm is a simple and elegant solution. 6 News Briefs By Adrian Bridgwater Recent news on tools,platforms,frameworks,and the state of the software development world. 7 Open-Source Dashboard A compilation of trending open-source projects. 34 Links Snapshots of interesting items on drdobbs.com including a look at the first steps to implementing Continuous Delivery and developing Android apps with Scala and Scaloid. www.drdobbs.com November 2013 Dr.Dobb’sJournal More on DrDobbs.com JoltAwards:TheBestBooks Five notable books everyserious programmer should read. http://www.drdobbs.com/240162065 A Massively Parallel Stack for Data Allocation Dynamic parallelism is an important evolutionary step in the CUDA software development platform.With it, developers can perform variable amounts of work based on divide-and-conquer algorithms and in-memory data structures such as trees and graphs — entirely on the GPU without host intervention. http://www.drdobbs.com/240162018 Introduction to Programming with Lists What it’s like to program with immutable lists. http://www.drdobbs.com/240162440 WhoAreSoftwareDevelopers? Ten years of surveys show an influx of younger devel- opers, more women, and personality profiles at odds with traditional stereotypes. http://www.drdobbs.com/240162014 Java and IoT In Motion Eric Bruno was involved in the construction of the In- ternet of Things (IoT) concept project called “IoT In Motion.” He helped build some of the back-end com- ponents including a RESTful service written in Java with some database queries,and helped a bit with the front-end as well. http://www.drdobbs.com/240162189
  • 3. www.drdobbs.com uch has been made in the past several years about SQL versus NoSQL and which model is better suited to mod- ern, scale-out deployments. Lost in many of these argu- ments is the raison d’être for SQL and the difference be- tween model and implementation. As new architectures emerge, the question is why SQL endures and why there is such a renewed interest in it today. Background In 1970,Edgar Codd captured his thoughts on relational logic in a pa- per that laid out rules for structuring and querying data (http://is.gd/upAlYi). A decade later, the Structured Query Language (SQL) began to emerge. While not entirely faithful to Codd’s original rules, it provided relational capabilities through a mostly declarative language and helped solve the problem of how to manage growing quantities of data. Over the next 30 years, SQL evolved into the canonical data-man- agement language, thanks largely to the clarity and power of its un- derlying model and transactional guarantees. For much of that time, deployments were dominated by scale-up or “vertical” architectures, in which increased capacity comes from upgrading to bigger,individ- ual systems.Unsurprisingly, this is also the design path that most SQL implementations followed. The term “NoSQL” was coined in 1998 by a database that provided relational logic but eschewed SQL (http://is.gd/sxH0qy).It wasn’t until 2009 that this term took on its current,non-ACID meaning.By then,typ- ical deployments had already shifted to scale-out or “horizontal” mod- els.The perception was that SQL could not provide scale-out capability, and so new non-SQL programming models gained popularity. Fast-forward to 2013 and after a period of decline, SQL is regaining popularity in the form of NewSQL (http://is.gd/x0c5uu) implementa- tions. Arguably, SQL never really lost popularity (the market is esti- mated at $30 billion and growing),it just went out of style.Either way, this new generation of systems is stepping back to look at the last 40 years and understand what that tells us about future design by apply- ing the power of relational logic to the requirements of scale-out de- ployments. Why SQL? SQL evolved as a language because it solved concrete problems. The relational model was built on capturing the flow of real-world data.If a purchase is made,it relates to some customer and product.If a song is [GUEST EDITORIAL] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 3 Do All Roads Lead Back to SQL? After distancing themselves from SQL, NoSQL products are moving towards transactional models as “NewSQL” gains popularity.What happened? By Seth Proctor M
  • 4. www.drdobbs.com played, it relates to an artist, an album, a genre, and so on. By defining these relations,programmers know how to work with data,and the sys- tem knows how to optimize queries. Once these relations are defined, then other uses of the data (audit,governance,etc.) are much easier. Layered on top of this model are transactions. Transactions are boundaries guaranteeing the programmer a consistent view of the database, independent execution relative to other transactions, and clear behavior when two transactions try to make conflicting changes. That’s the A (atomicity),C (consistency),and I (isolation) in ACID.To say a transaction has committed means that these rules were met, and that any changes were made Durable (the D in ACID).Either everything succeeds or nothing is changed. Transactions were introduced as a simplification.They free develop- ers from having to think about concurrent access,locking,or whether their changes are recorded.In this model,a multithreaded service can be programmed as if there were only a single thread. Such program- ming simplification is extremely useful on a single server.When scaling across a distributed environment,it becomes critical. With these features in place,developers building on SQL were able to be more productive and focus on their applications.Of particular impor- tance is consistency.Many NoSQL systems sacrifice consistency for scal- ability, putting the burden back on application developers.This trade- off makes it easier to build a scale-out database,but typically leaves de- velopers choosing between scale and transactional consistency. Why Not SQL? It’s natural to ask why SQL is seen as a mismatch for scale-out archi- tectures, and there are a few key answers. The first is that traditional SQL implementations have trouble scaling horizontally.This has led to approaches like sharding,passive replication,and shared-disk cluster- ing. The limitations (http://is.gd/SaoHcL) are functions of designing around direct disk interaction and limited main memory,however,and not inherent in SQL. A second issue is structure.Many NoSQL systems tout the benefit of having no (or a limited) schema.In practice,developers still need some contract with their data to be effective.It’s flexibility that’s needed — an easy and efficient way to change structure and types as an appli- cation evolves. The common perception is that SQL cannot provide this flexibility, but again, this is a function of implementation. When table structure is tied to on-disk representation, making changes to that structure is very expensive; whereas nothing in Codd’s logic makes adding or renaming a column expensive. Finally,some argue that SQL itself is too complicated a language for today’s programmers. The arguments on both sides are somewhat subjective,but the reality is that SQL is a widely used language with a large community of programmers and a deep base of tools for tasks like authoring,backup,or analysis.Many NewSQL systems are layering simpler languages on top of full SQL support to help bridge the gap between NoSQL and SQL systems.Both have their utility and their uses in modern environments.To many developers,however,being able to [GUEST EDITORIAL ] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 4 “Many NoSQL systems tout the benefit of having no (or a limited) schema.In practice,developers still need some contract with their data to be effective”
  • 5. www.drdobbs.com reuse tools and experience in the context of a scale-out database means not having to compromise on scale versus consistency. Where Are We Heading? The last few years have seen renewed excitement around SQL. NewSQL systems have emerged that support transactional SQL, built on original architectures that address scale-out requirements. These systems are demonstrating that transactions and SQL can scale when built on the right design. Google, for instance, developed F1 (http://is.gd/Z3UDRU) because it viewed SQL as the right way to ad- dress concurrency,consistency,and durability requirements.F1 is spe- cific to the Google infrastructure but is proof that SQL can scale and that the programming model still solves critical problems in today’s data centers. Increasingly, NewSQL systems are showing scale, schema flexibility, and ease of use. Interestingly, many NoSQL and analytic systems are now putting limited transactional support or richer query languages into their roadmaps in a move to fill in the gaps around ACID and de- clarative programming.What that means for the evolution of these sys- tems is yet to be seen, but clearly, the appeal of Codd’s model is as strong as ever 43 years later. — Seth Proctor serves as Chief Technology Officer of NuoDB Inc.and has more than 15yearsofexperienceintheresearch,design,andimplementationofscalablesystems. His previous work includes contributions to the Java security framework,the Solaris operatingsystem,andseveralopen-sourceprojects. [GUEST EDITORIAL ] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 5
  • 6. INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> www.drdobbs.com News Briefs [NEWS] November 2013 6 Progress Pacific PaaS Is A Wider Developer’s PaaS Progress has used its Progress Exchange 2013 exhibition and devel- oper conference to announce new features in the Progress Pacific plat- form-as-a-service (PaaS) that allow more time and energy to be spent solving business problems with data-driven applications and less time worrying about technology and writing code. This is a case of cloud- centric data-driven software application development supporting workflows that are engineered to RealTime Data (RTD) from disparate sources, other SaaS entities, sensors, and points within the Internet of Things — for developers,these workflows must be functional for mo- bile, on premise, and hybrid apps where minimal coding is required such that the programmer is isolated to a degree from the complexity of middleware,APIs,and drivers. http://www.drdobbs.com/240162366 New Java Module In SOASTA CloudTest SOASTA has announced the latest release of CloudTest with a new Java module to enable developers and testers of Java applications to test any Java component as they work to “easily scale” it.Direct-to-database testing here supports Oracle, Microsoft SQL Server, and PostgreSQL databases — and this is important for end-to-end testing for enterprise developers. Also, additional in-memory processing enhancements make dashboard loading faster for in-test analytics.New CloudTest ca- pabilities include Direct-to-Database testing.CloudTest users can now directly test the scalability of the most popular enterprise and open source SQL databases from Oracle, Microsoft SQL Server, and Post- greSQL. http://www.drdobbs.com/240162292 HBase Apps And The 20 Millisecond Factor MapRTechnologies has updated its M7 edition to improve HBase appli- cation performance with throughput that is 4-10x faster while eliminat- ing latency spikes.HBase applications can now benefit from MapR’s plat- form to address one of the major issues for online applications, consistent read latencies in the “less than 20 millisecond” range,as they exist across varying workloads. Differentiated features here include ar- chitecture that persists table structure at the filesystem layer; no com- pactions (I/O storms) for HBase applications; workload-aware splits for HBase applications;direct writes to disk (vs.writing to an external filesys- tem);disk and network compression;and C++ implementation that does not suffer from garbage collection problems seen with Java applications. http://www.drdobbs.com/240162218 Sauce Labs and Microsoft Whip Up BrowserSwarm Sauce Labs and Microsoft have partnered to announce Browser- Swarm,a project to streamline JavaScript testing of Web and mobile apps and decrease the amount of time developers spend on debug- ging application errors. BrowserSwarm is a tool that automates test- ing of JavaScript across browsers and mobile devices. It connects di- rectly to a development team’s code repository on GitHub.When the code gets updated, BrowserSwarm automatically executes a suite of tests using common unit testing frameworks against a wide array of browser and OS combinations. BrowserSwarm is powered on the backend by Sauce Labs and allows developers and QA engineers to automatically test web and mobile apps across 150+ browser / OS combinations, including iOS, Android, and Mac OS X. http://www.drdobbs.com/240162298 By Adrian Bridgwater
  • 7. INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> www.drdobbs.com [OPEN-SOURCE DASHBOARD] November 2013 7 TOP OPEN-SOURCE PROJECTS Trending this month on GitHub: jlukic/Semantic-UI JavaScript https://github.com/jlukic/Semantic-UI Creating a shared vocabulary for UI. HubSpot/pace CSS https://github.com/HubSpot/pace Automatic Web page progress bar. maroslaw/rainyday.js JavaScript https://github.com/maroslaw/rainyday.js Simulating raindrops falling on a window. peachananr/onepage-scroll JavaScript https://github.com/peachananr/onepage-scroll Create an Apple-like one page scroller website (iPhone 5S website) with One Page Scroll plugin. twbs/bootstrap JavaScript https://github.com/twbs/bootstrap Sleek,intuitive,and powerful front-end framework for faster and easier Web development. mozilla/togetherjs JavaScript https://github.com/mozilla/togetherjs A service for your website that makes it surprisingly easy to collaborate in real-time. daviferreira/medium-editor JavaScript https://github.com/daviferreira/medium-editor Medium.com WYSIWYG editor clone. alvarotrigo/fullPage.js JavaScript https://github.com/alvarotrigo/fullPage.js fullPage plugin by Alvaro Trigo.Create full-screen pages fast and simple. angular/angular.js JavaScript https://github.com/angular/angular.js Extend HTML vocabulary for your applications. Trending this month on SourceForge: Notepad++ Plugin Manager http://sourceforge.net/projects/npppluginmgr/ The plugin list for Notepad++ Plugin Manager with code for the plugin manager. MinGW:Minimalist GNU for Windows: http://sourceforge.net/projects/mingw/ A native Windows port of the GNU Compiler Collection (GCC). Apache OpenOffice http://sourceforge.net/projects/openofficeorg.mirror/ An open-source office productivity software suite containing word processor, spreadsheet,presentation,graphics,formula editor,and database management applications. YTD Android http://sourceforge.net/projects/rahul/ Files Downloader is a free powerful utility that will help you to download your favorite videos from youtube.The application is platform-independent. PortableApps.com http://sourceforge.net/projects/portableapps/ Popular portable software solution. Media Player Classic:Home Cinema http://sourceforge.net/projects/mpc-hc/ This project is based on the original Guliverkli project,and contains additional features and bug fixes (see complete list on the project’s website). Anti-Spam SMTP Proxy Server http://sourceforge.net/projects/assp/ The Anti-Spam SMTP Proxy (ASSP) Server project aims to create an open- source platform-independent SMTP Proxy server. Ubuntuzilla:Mozilla Software Installer http://sourceforge.net/projects/ubuntuzilla/ An APT repository hosting the Mozilla builds of the latest official releases of Firefox,Thunderbird,and Seamonkey.
  • 8. November 2013 8www.drdobbs.com Understanding What Big Data Can Deliver It’s easy to err by pushing data to fit a projected model. Insights come, however, from accepting the data’s ability to depict what is going on, without imposing an a priori bias. ith all the hype and anti-hype surrounding Big Data,the data management practitioner is, in an ironic turn of events, in- undated with information about Big Data. It is easy to get lost trying to figure out whether you have Big Data problems and, if so, how to solve them. It turns out the secret to taming your Big Data problems is in the detail data.This article explains how focusing on the details is the most important part of a successful Big Data project. Big Data is not a new idea.Gartner coined the term a decade ago,de- scribing Big Data as data that exhibits three attributes:Volume,Velocity, and Variety. Industry pundits have been trying to figure out what that means ever since. Some have even added more “Vs” to try and better explain why Big Data is something new and different than all the other data that came before it. The cadence of commentary on Big Data has quickened to the extent that if you set up a Google News alert for “Big Data,” you will spend more of your day reading about Big Data than implementing a Big Data solution.What the analysts gloss over and the vendors attempt to sim- plify is that Big Data is primarily a function of digging into the details of the data you already have. Gartner might have coined the term “Big Data,” but they did not invent the concept. Big Data was just rarer then than it is today. Many companies have been managing Big Data for ten years or more. These companies may have not had the efficiencies of scale that we benefit from currently, yet they were certainly paying atten- tion to the details of their data and storing as much of it as they could afford. A Brief History of Data Management Data management has always been a balancing act between the vol- ume of data and our capacity to store,process,and understand it. The biggest achievement of the On Line Analytic Processing (OLAP) era was to give users interactive access to data,which was summarized across multiple dimensions. OLAP systems spent a significant amount of time up front to pre-calculate a wide variety of aggregations over a data set that could not otherwise be queried interactively.The output was called a “cube” and was typically stored in memory, giving end users the ability to ask any question that had a pre-computed answer and get results in less than a second. By Aaron Kimball [WHAT BIG DATA CAN DELIVER] W INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >>
  • 9. www.drdobbs.com Big Data is exploding as we enter the era of plenty — high band- width,greater storage capacity,and many processor cores.New soft- ware, written after these systems became available, is different than its forebears. Instead of highly tuned, high-priced systems that op- timize for the minimum amount of data required to answer a ques- tion, the new software captures as much data as possible in order to answer as-yet-undefined queries. With this new data captured and stored, there are a lot of details that were previously unseen. Why More Data Beats Better Algorithms Before I get into how detail data is used, it is crucial to understand at the algorithmic level the signal importance of detail data. Since the former Director ofTechnology at Amazon.com,Anand Rajaraman,first expounded the concept that “more data beats better algorithms,” his claim has been supported and attacked many times.The truth behind his assertion is rather subtle. To really understand it, we need to be more specific about what Rajaraman said,then explain in a simple ex- ample how it works. Experienced statisticians understand that having more training data can improve the accuracy of and confidence in a model.For example, say we believe that the relationship between two variables — such as number of pages viewed on a website and percent likelihood to make a purchase — is linear. Having more data points would improve our estimate of the underlying linear relationship.Compare the graphs in Figures 1 and 2, showing that more data will give us a more accurate and confident estimation of the linear relationship. A statistician would also be quick to point out that we cannot in- crease the effectiveness of this pre-selected model by adding even more data. Adding another 100 data points to Figure 2, for example, would not greatly improve the accuracy of the model. The marginal [WHAT BIG DATA CAN DELIVER] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 9 Figure 1:Using little data to estimate a relationship. Figure 2:The same relationship with more data.
  • 10. www.drdobbs.com [WHAT BIG DATA CAN DELIVER] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 10 benefit of adding more training data in this case decreases quickly.Given this example, we could argue that having more data does not always beat more-sophisticated algorithms at predicting the expected out- come. To increase accuracy as we add data, we would need to change our model. The “trick” to effectively using more data is to make fewer initial as- sumptions about the underlying model and let the data guide which model is most appropriate. In Figure 1, we assumed the linear model after collecting very little data about the relationship between page views and propensity to purchase.As we will see, if we deploy our lin- ear model, which was built on a small sample of data, onto a large data set, we will not get very accurate estimates. If instead we are not constrained by data collection, we could collect and plot all of the data before committing to any simplifying assumptions. In Figure 3, we see that additional data reveals a more complex clustering of data points. By making a few weak (that is,tentative) assumptions,we can evaluate alternative models. For example, we can use a density estimation tech- nique instead of using the linear parametric model, or use other tech- niques. With an order of magnitude more data, we might see that the true relationship is not linear.For example,representing our model as a histogram as in Figure 4 would produce a much better picture of the underlying relationship. Linear regression does not predict the relationship between the vari- ables accurately because we have already made too strong an assump- tion that does not allow for additional unique features in the data to be Figure 3:Even more data shows a different relationship. Figure 4:The data in Figure 3 represented as a histogram.
  • 11. www.drdobbs.com captured — such as the U-shaped dip between 20 and 30 on the x- axis.With this much data, using a histogram results in a very accurate model. Detail data allows us to pick a nonparametric model — such as estimating a distribution with a histogram — and gives us more confidence that we are building an accurate model. If this were a much larger parameter space, the model itself, repre- sented by just the histogram,could be very large.Using nonparametric models is common in Big Data analysis because detail data allows us to let the data guide our model selection, especially when the model is too large to fit in memory on a single machine. Some examples in- clude item similarity matrices for millions of products and association rules derived using collaborative filtering techniques. One Model to Rule Them All The example in Figures 1 through 4 demonstrates a two-dimensional model mapping the number of pages a customer views on a website to the percent likelihood that the customer will make a purchase. It may be the case that one type of customer,say a homemaker looking for the right style of throw pillow, is more likely to make a purchase the more pages they view. Another type of customer — for example, an amateur contractor — may only view a lot of pages when doing re- search. Contractors might be more likely to make a purchase when they go directly to the product they know they want.Introducing ad- ditional dimensions can dramatically complicate the model;and main- taining a single model can create an overly generalized estimation. Customer segmentation can be used to increase the accuracy of a model while keeping complexity under control. By using additional data to first identify which model to apply, it is possible to introduce additional dimensions and derive more-accurate estimations. In this example, by looking at the first product that a customer searches for, we can select a different model to apply based on our prediction of which segment of the population the customer falls into.We use a dif- ferent model for segmentation based on data that is related yet dis- tinct from the data we use for the model that predicts how likely the customer is to make a purchase. First, we consider a specific product that they look at and then we consider the number of pages they visit. Demographics and Segmentation No Longer Are Sufficient Applications that focus on identifying categories of users are built with user segmentation systems.Historically,user segmentation was based on demographic information. For example, a customer might have been identified as a male between the ages of 25-34 with an annual household income of $100,000-$150,000 and living in a particular county or zip code.As a means of powering advertising channels such as television, radio, newspapers, or direct mailings, this level of detail was sufficient. Each media outlet would survey its listeners or readers to identify the demographics for a particular piece of syndicated con- tent and advertisers could pick a spot based on the audience segment. With the evolution of online advertising and Internet-based media, segmentation started to become more refined.Instead of a dozen de- mographic attributes,publishers were able to get much more specific [WHAT BIG DATA CAN DELIVER] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 11 “Using nonparametric models is common in Big Data analysis because detail data allows us to let the data guide our model selection”
  • 12. www.drdobbs.com about a customer’s profile. For example, based on Internet browsing habits,retailers could tell whether a customer lived alone,were in a re- lationship,traveled regularly,and so on.All this information was avail- able previously but it was difficult to collate. By instrumenting cus- tomer website browsing behavior and correlating this data with purchases, retailers could fine tune their segmenting algorithms and create ads targeted to specific types of customers. Today, nearly every Web page a user views is connected directly to an advertising network.These ad networks connect to ad exchanges to find bidders for the screen real estate of the user’s Web browser.Ad exchanges operate like stock exchanges except that each bid slot is for a one-time ad to a specific user.The exchange uses the user’s profile information or their browser cookies to convey the customer segment of the user. Advertisers work with specialized digital marketing firms whose algorithms try to match the potential viewer of an advertise- ment with the available ad inventory and bid appropriately. Real-Time Updating of Data Matters (People Aren’t Static) Segmentation data used to change rarely with one segmentation map reflecting the profile of a particular audience for months at a time; to- day, segmentation can be updated throughout the day as customers’ profiles change.Using the same information gleaned from user behav- ior that assigns a customer’s initial segment group, organizations can update a customer’s segment on a click-by-click basis.Each action bet- ter informs the segmentation model and is used to identify what in- formation to present next. The process of constantly re-evaluating customer segmentation has enabled new dynamic applications that were previously impos- sible in the offline world.For example,when a model results in an in- correct segmentation assignment, new data based on customer ac- tions can be used to update the model.If presenting the homemaker with a power tool prompts the homemaker to go back to the search bar,the segmentation results are probably mistaken.As details about a customer emerge, the model’s results become more accurate. A customer that the model initially predicted was an amateur contrac- tor looking at large quantities of lumber may in fact be a professional contractor. By constantly collecting new data and re-evaluating the models,on- line applications can tailor the experience to precisely what a customer is looking for. Over longer periods of time, models can take into ac- count new data and adjust based on larger trends. For example, a stereotypical life trajectory involves entering into a long-term relation- ship, getting engaged, getting married, having children, and moving to the suburbs.At each stage in life and in particular during the transi- tions,one’s segment group changes.By collecting detailed data about online behaviors and constantly reassessing the segmentation model, these life transitions are automatically incorporated into the user’s ap- plication experience. [WHAT BIG DATA CAN DELIVER] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 12 “Big Data has seen a lot of hype in recent years,yet it remains unclear to most practitioners where they need to focus their time and attention.Big Data is,in large part, about paying attention to the details in a data set”
  • 13. www.drdobbs.com Instrument Everything We’ve shown examples of how detail data can be used to pick better models, which result in more accurate predictions. And I have ex- plained how models built on detail data can be used to create better application experiences and adapt more quickly to changes in cus- tomer behavior.If you’ve become a believer in the power of detail data and you’re not already drowning in it,you likely want to know how to get some. It is often said that the only way to get better at something is to measure it.This is true of customer engagement as well. By recording the details of an application,organizations can effectively recreate the flow of interaction.This includes not just the record of purchases, but a record of each page view, every search query, or selected category, and the details of all items that a customer viewed. Imagine a store clerk, taking notes as a customer browses and shops or asks for assis- tance.All of these actions can be captured automatically when the in- teraction is digital. Instrumentation can be accomplished in two ways. Most modern Web and application servers record logs of their activity to assist with operations and troubleshooting.By processing these logs,it is possible to extract the relevant information about user interactions with an ap- plication. A more direct method of instrumentation is to explicitly record actions taken by an application into a database.When the ap- plication,running in an application server,receives a request to display all the throw pillows in the catalog, it records this request and associ- ates it with the current user. Test Constantly The result of collecting detail data, building more accurate models, and refining customer segments is a lot of variability in what gets shown to a particular customer. As with any model-based system, past performance is not necessarily indicative of future results. The relationships between variables change,customer behavior changes, and of course reference data such as product catalogs change.In or- der to know whether a model is producing results that help drive customers to success,organizations must test and compare multiple models. A/B testing is used to compare the performance of a fixed number of experiments over a set amount of time.For example,when deciding which of several versions of an image of a pillow a customer is most likely to click on,you can select a subset of customers to show one im- age or another.What A/B testing does not capture is the reason behind a result.It may be by chance that a high percentage of customers who saw version A of the pillow were not looking for pillows at all and would not have clicked on version B either. An alternative to A/B testing is a class of techniques called Bandit al- gorithms. Bandit algorithms use the results of multiple models and [WHAT BIG DATA CAN DELIVER] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 13 AutomaticDataCollection Somedataisalreadycollectedautomatically.EveryWebserverrecordsdetailsabout theinformationrequestedbythecustomer’sWebbrowser.Whilenotwellorganized orobviouslyusable,thisinformationoftenincludessufficientdetailtoreconstructa customer’s session.The log records include timestamps,session identifiers,client IP address and the request URL including the query string.If this data is combined with a session table, a geo-IP database and a product catalog, it is possible to fairly accuratelyreconstructthecustomer’sbrowsingexperience.
  • 14. www.drdobbs.com constantly evaluate which experiment to run. Experiments that per- form better (for any reason) are shown more often. The result is that experiments can be run constantly and measured against the data col- lected for each experiment.The combinations do not need to be pre- determined and the more successful experiments automatically get more exposure. Conclusion Big Data has seen a lot of hype in recent years,yet it remains unclear to most practitioners where they need to focus their time and at- tention. Big Data is, in large part, about paying attention to the de- tails in a data set. The techniques available historically have been limited to the level of detail that the hardware available at the time could process. Recent developments in hardware capabilities have led to new software that makes it cost effective to store all of an or- ganization’s detail data. As a result, organizations have developed new techniques around model selection, segmentation and experi- mentation. To get started with Big Data, instrument your organiza- tion’s applications, start paying attention to the details, let the data inform the models — and test everything. —AaronKimballfoundedWibiDatain2010andistheChiefArchitectfortheKijiproj- ect.HehasworkedwithHadoopsince2007andisacommitterontheApacheHadoop project.Inaddition,AaronfoundedApacheSqoop,whichconnectsHadooptorelational databasesandApacheMRUnitfortestingHadoopprojects. [WHAT BIG DATA CAN DELIVER] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 14
  • 15. November 2013 15www.drdobbs.com Applying the Big Data Lambda Architecture A look inside a Hadoop-based project that matches connections in social media by leveraging the highly scalable lambda architecture. ased on his experience working on distributed data process- ing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well- known in Big Data: He’s the driving force behind Storm (see page 24) and atTwitter he led the streaming compute team,which provides and develops shared infrastructure to support critical real-time applications. Marz and his team described the underlying motivation for building systems with the lambda architecture as: • The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes. • To serve a wide range of workloads and use cases, in which low- latency reads and updates are required.Related to this point,the system should support ad-hoc queries. • The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job. • The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance. From a bird’s eye view the lambda architecture has three major com- ponents that interact with new data coming in and responds to queries,which in this article are driven from the command line: By Michael Hausenblas [LAMBDA] B INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> Figure 1:Overview of the lambda architecture.
  • 16. www.drdobbs.com Essentially, the Lambda Architecture comprises the following com- ponents,processes,and responsibilities: • New Data:All data entering the system is dispatched to both the batch layer and the speed layer for processing. • Batch layer:This layer has two functions:(i) managing the master dataset, an immutable, append-only set of raw data, and (ii) to pre-compute arbitrary query functions, called batch views. Hadoop’s HDFS (http://is.gd/Emgj57) is typically used to store the master dataset and perform the computation of the batch views using MapReduce (http://is.gd/StjZaI). • Serving layer: This layer indexes the batch views so that they can be queried in ad hoc with low latency. To implement the serving layer, usually technologies such as Apache HBase (http://is.gd/2ro9CY) or ElephantDB (http://is.gd/KgIZ2G) are utilized.The Apache Drill project (http://is.gd/wB1IYy) provides the capability to execute full ANSI SQL 2003 queries against batch views. • Speed layer:This layer compensates for the high latency of updates to the serving layer, due to the batch layer. Using fast and incre- mental algorithms, the speed layer deals with recent data only. Storm (http://is.gd/qP7fkZ) is often used to implement this layer. • Queries:Last but not least,any incoming query can be answered by merging results from batch views and real-time views. Scope and Architecture of the Project In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social net- works, such as Twitter or LinkedIn, or from real-life circumstances.The aim is to scale out to several billions of users while providing low-la- tency access to the stored information.To keep the system simple and comprehensible,I limit myself to bulk import of the data (no capabili- ties to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture. It’s easiest to think about USN in terms of two orthogonal phases: • Build-time, which includes the data pre-processing, generating the master dataset as well as creating the batch views. • Runtime,in which the data is actually used,primarily via issuing queries against the data space. The USN app architecture is shown below in Figure 2: [LAMBDA] November 2013 16 INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> Figure 2:High-level architecture diagram of the USN app.
  • 17. www.drdobbs.com The following subsytems and processes, in line with the lambda ar- chitecture,are at work in USN: • Data pre-processing. Strictly speaking this can be considered part of the batch layer. It can also be seen as an independent process necessary to bring the data into a shape that is suitable for the master dataset generation. • The batch layer. Here, a bash shell script (http://is.gd/smhcl6) is used to drive a number of HiveQL (http://is.gd/8qSOSF) queries (see the GitHub repo, in the batch-layer folder at http://is.gd/QDU6pH) that are responsible to load the pre- processed input CSV data into HDFS. • The serving layer. In this layer, we use a Python script (http://is.gd/Qzklmw) that loads the data from HDFS via Hive and inserts it into a HBase table, and hence creating a batch view of the data.This layer also provides query capabilities,necessary in the runtime phase to serve the front-end. • Command-line front end.The USN app front-end is a bash shell script (http://is.gd/nFZoqB) interacting with the end-user and providing operations such as listings,lookups,and search. This is all there is from an architectural point of view.You may have noticed that there is no speed layer in USN, as of now. This is due to the scope I initially introduced above.At the end of this article,I’ll revisit this topic. The USN App Technology Stack and Data Recently, Dr. Dobb’s discussed Pydoop: Writing Hadoop Programs in Python (http://www.drdobbs.com/240156473), which will serve as a gentle introduction into setting up and using Hadoop with Python.I’m going to use a mixture of Python and bash shell scripts to implement the USN. However, I won’t rely on the low-level MapReduce API pro- vided by Pydoop,but rather on higher-level libraries that interface with Hive and HBase,which are part of Hadoop.Note that the entire source code,including the test data and all queries as well as the front-end,is available in a GitHub repository (http://is.gd/XFI4wY), and it is neces- sary to follow along with this implementation. Before I go into the technical details such as the concrete technology stack used,let’s have a quick look at the data transformation happen- ing between the batch and the serving layer (Figure 3). [LAMBDA] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 17 Figure 3:Data transformation from batch to serving layer in the USN app.
  • 18. www.drdobbs.com As hinted in Figure 3, the master dataset (left) is a collection of atomic actions: either a user has added someone to their networks or the reverse has taken place, a person has been removed from a network. This form of the data is as raw as it gets in the context of our USN app and can serve as the basis for a variety of views that are able to answer different sorts of queries. For simplicity’s sake, I only consider one possible view that is used in the USN app front- end: the “network-friends” view, per user, shown in the right part of Figure 3. Raw Input Data The raw input data is a Comma Separated Value (CSV) file with the fol- lowing format: timestamp,originator,action,network,target,context 2012-03-12T22:54:13-07:00,Michael,ADD,I,Ora Hatfield, bla 2012-11-23T01:53:42-08:00,Ted,REMOVE,I,Marvin Garrison, meh ... The raw CSV file contains the following six columns: • timestamp is an ISO 8601 formatted date-time stamp that states when the action was performed (range:January 2012 to May 2013). • originator is the name of the person who added or removed a person to or from one of his or her networks. • action must be either ADD or REMOVE and designates the action that has been carried out.That is, it indicates whether a person has been added or removed from the respective network. • network is a single character indicating the respective network where the action has been performed. The possible values are: I,in-real-life;T,Twitter;L,LinkedIn; F,Facebook;G,Google+ • target is the name of the person added to or removed from the network. • context is a free-text comment,providing a hint why the person has been added/removed or where one has met the person in the first place. There are no optional fields in the dataset. In other words: Each row is completely filled. In order to generate some test data to be used in the USN app, I’ve created a raw input CSV file from generatedata.com in five runs,yielding some 500 rows of raw data. Technology Stack USN uses several software frameworks,libraries,and components,as I mentioned earlier.I’ve tested it with: • Apache Hadoop 1.0.4 (http://is.gd/4suWof) • Apache Hive 0.10.0 (http://is.gd/tOfbsP) • Hiver for Hive access from Python (http://is.gd/OXujzB) • Apache HBase 0.94.4 (http://is.gd/7VnBqR) • HappyBase for HBase access from Python (http://is.gd/BuJzaH) I assume that you’re familiar with the bash shell and have Python 2.7 or above installed. I’ve tested the USN app under Mac OS X 10.8 but there are no hard dependencies on any Mac OS X specific features,so it should run unchanged under any Linux environment. Building the USN Data Space The first step is to build the data space for the USN app, that is, the master dataset and the batch view,and then we will have a closer look behind the scenes of each of the commands. [LAMBDA] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 18
  • 19. www.drdobbs.com First,some pre-processing of the raw data,generated earlier: $ pwd /Users/mhausenblas2/Documents/repos/usn-app/data $ ./usn-preprocess.sh < usn-raw-data.csv > usn-base-data.csv Next, we want to build the batch layer. For this, I first need to make sure that the Hive Thrift service is running: $ pwd /Users/mhausenblas2/Documents/repos/usn-app/batch-layer $ hive --service hiveserver Starting Hive Thrift Server ... Now,I can run the script that execute the Hive queries and builds our USN app master dataset,like so: $ pwd /Users/mhausenblas2/Documents/repos/usn-app/batch-layer $ ./batch-layer.sh INIT USN batch layer created. $ ./batch-layer.sh CHECK The USN batch layer seems OK. This generates the batch layer, which is in HDFS. Next, I create the serving layer in HBase by building a view of the relationships to people. For this, both the Hive and HBase Thrift services need to be running. Below,you see how you start the HBase Thrift service: $ echo $HBASE_HOME /Users/mhausenblas2/bin/hbase-0.94.4 $ cd /Users/mhausenblas2/bin/hbase-0.94.4 $ ./bin/start-hbase.sh starting master, logging to /Users/... $ ./bin/hbase thrift start -p 9191 13/05/31 09:39:09 INFO util.VersionInfo: HBase 0.94.4 As now both Hive and HBaseThrift services are up and running,I can run the following command (in the respective directory, wherever you’ve unzipped or cloned the GitHub repository): $ echo $HBASE_HOME /Users/mhausenblas2/bin/hbase-0.94.4 $ cd /Users/mhausenblas2/bin/hbase-0.94.4 $ ./bin/start-hbase.sh starting master, logging to /Users/... $ ./bin/hbase thrift start -p 9191 13/05/31 09:39:09 INFO util.VersionInfo: HBase 0.94.4 Now,let’s have a closer look at what is happening behind the scenes of each of the layers in the next sections. The Batch Layer The raw data is first pre-processed and loaded into Hive. In Hive (re- member, this constitutes the master dataset in the batch layer of our USN app) the following schema is used: CREATE TABLE usn_base ( actiontime STRING, originator STRING, action STRING, network STRING, target STRING, context STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’; To import the CSV data, to build the master dataset, the shell script batch-layer.sh executes the following HiveQL commands: LOAD DATA LOCAL INPATH ‘../data/usn-base-data.csv’ INTO TABLE usn_base; [LAMBDA] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 19
  • 20. www.drdobbs.com DROP TABLE IF EXISTS usn_friends; CREATE TABLE usn_friends AS SELECT actiontime, originator AS username, network, target AS friend, context AS note FROM usn_base WHERE action = ‘ADD’ ORDER BY username, network, username; With this,the USN app master dataset is ready and available in HDFS and I can move on to the next layer,the serving layer. The Serving Layer of the USN App The batch view used in the USN app is realized via an HBase table called usn_friends.This table is then used to drive the USN app front- end;it has the schema shown in Figure 4. After building the serving layer, I can use the HBase shell to verify if the batch view has been properly populated in the respective table usn_friends: $ ./bin/hbase shell hbase(main):001:0> describe ‘usn_friends’ ... {NAME => ‘usn_friends’, FAMILIES => [{NAME => ‘a’, DATA_BLOCK_ENCODING => ‘NONE’, BLOOMFILTER => ‘N true ONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’, COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL => ‘-1’, KEEP_DELETED_CELLS => ‘false’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, ENCODE_ON_DISK => ‘true’, BLOCKCACHE => ‘false’}]} 1 row(s) in 0.2450 seconds You can have a look at some more queries used in the demo user in- terface on theWiki page of the GitHub repository (http://is.gd/7v0IXz). Putting It All Together After the batch and serving layers have been initialized and launched, as described, you can launch the user interface. To use the CLI, make sure that HBase and the HBase Thrift service are running and then, in the main USN app directory run: $ ./usn-ui.sh This is USN v0.0 u ... user listings, n ... network listings, l ... lookup, s ... search, h ... help, q ... quit [LAMBDA] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 20 Figure 4:HBase schema used in the serving layer of the USN app.
  • 21. www.drdobbs.com Figure 5 shows a screen shot of the USN app front-end in action.The three main operations the USN front-end provides are as follows: • u ... user listing lists all acquaintances of a user • n ... network listing lists acquaintances of a user in a net- work • l ... lookup listing lists acquaintances of a user in a net- work and allows restrictions on the time range (from/to) of the acquaintanceship • s ... search provides search for an acquaintance over all users,allowing for partial match An example USN app front-end session is available at the GitHub repo (http://is.gd/c3i6FW) for you to study. What’s Next? I have intentionally kept USN simple. Although fully functional, it has several intentional limitations (due to space restrictions here). I can suggest several improvements you could have a go at,using the avail- able code base (http://is.gd/XFI4wY) as a starting point. • Bigger data:The most obvious point is not the app itself but the data size.Only laughable 500 rows? This isn’t Big Data I hear you say. Rightly so. Now, no one stops you generating 500 million rows or more and try it out. Certain processes such as pre-pro- cessing and the generating the layers will take longer but there are no architectural changes necessary, and this is the whole point of this USN app. • Creating a full-blown batch layer: Currently, the batch layer is a sort of one-shot,while it should really run in a loop and append new data. This requires partitioning of the ingested data and some checks. Pail (http://is.gd/sJAKGN), for example, allows you to do the ingestion and partitioning in a very elegant way. • Adding speed layer and automated import: It would be inter- esting to automate the import of data from the various social networks. For example, Google Takeout (http://is.gd/Zy0HcB) allows exporting all data in bulk mode,including G+ Circles.For a stab at the speed layer, one could try and utilize the Twitter fire-hose (http://is.gd/xVroGO) along with Storm. • More batch views:There is currently only one view (friend list per network, per user) in the serving layer.The USN app might ben- efit from different views to enable different queries most effi- ciently,such as time-series views of network growth or overlaps of acquaintanceships across networks. [LAMBDA] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 21 Figure 5:Screen-shot of the USN app command line user interface.
  • 22. www.drdobbs.com I hope you have as much fun playing around with the USN app and extending it as I had writing it in the first place. I’d love to hear back from you on ideas or further improvements either directly here as a comment or via the GitHub issue tracker of the USN app repository. Further Resources • A must-read for the Lambda Architecture is the Big Data book by Nathan Marz and James Warren from Manning (http://is.gd/lPtVJS).The USN app idea actually stems from one of the examples used in this book. • Slide deck on a real time architecture using Hadoop and Storm (http://is.gd/nz0wD6) from FOSDEM 2013. • A blog post about an example “lambda architecture” for real- time analysis of hashtags usingTrident,Hadoop,and Splout SQL (http://is.gd/ZTJarF). • Additional batch layer technologies such as Pail (http://is.gd/sJAKGN) for managing the master dataset and JCascalog (http://is.gd/i7jf1W) for creating the batch views. • Apache Drill (http://is.gd/wB1IYy) for providing interactive, ad- hoc queries against HDFS,HBase,or other NoSQL back-ends. • Additional speed layer technologies, such as Trident (http://is.gd/Bxqt9j),a high-level abstraction for doing real-time computing on top of Storm and MapR’s Direct Access NFS (http://is.gd/BaoE0l) to land data directly from streaming sources such as social media streams or sensor devices. —MichaelHausenblasistheChiefDataEngineerEMEA,MapRTechnologies. [LAMBDA] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 22
  • 23. November 2013 23www.drdobbs.com Easy,Real-Time Big Data Analysis Using Storm Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze. By Shruthi Kumar and Siddharth Patankar [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> oday,companies regularly generate terabytes of data in their daily operations. The sources include everything from data captured from network sensors, to the Web, social media, transactional business data, and data created in other busi- ness contexts. Given the volume of data being generated, real-time computation has become a major challenge faced by many organiza- tions. A scalable real-time computation system that we have used ef- fectively is the open-source Storm tool,which was developed atTwitter and is sometimes referred to as “real-time Hadoop.” However, Storm (http://storm-project.net/) is far simpler to use than Hadoop in that it does not require mastering an alternate universe of new technologies simply to handle big data jobs. This article explains how to use Storm. The example project, called “Speeding Alert System,” analyzes real-time data and raises a trigger and relevant data to a database, when the speed of a vehicle exceeds a predefined threshold. Storm Whereas Hadoop relies on batch processing, Storm is a real-time, dis- tributed, fault-tolerant, computation system. Like Hadoop, it can process huge amounts of data — but does so in real time — with guar- anteed reliability; that is, every message will be processed. Storm also offers features such as fault tolerance and distributed computation, which make it suitable for processing huge amounts of data on differ- ent machines.It has these features as well: • It has simple scalability. To scale, you simply add machines and change parallelism settings of the topology. Storm’s usage of T From the Vault
  • 24. Hadoop’s Zookeeper for cluster coordination makes it scalable for large cluster sizes. • It guarantees processing of every message. • Storm clusters are easy to manage. • Storm is fault tolerant:Once a topology is submitted,Storm runs the topology until it is killed or the cluster is shut down. Also, if there are faults during execution, reassignment of tasks is han- dled by Storm. • Topologies in Storm can be defined in any language, although typically Java is used. To follow the rest of the article, you first need to install and set up Storm.The steps are straightforward: • Download the Storm archive from the official Storm website (http://storm-project.net/downloads.html). • Unpack the bin/ directory onto your PATH and make sure the bin/storm script is executable. Storm Components A Storm cluster mainly consists of a master and worker node,with co- ordination done by Zookeeper. • Master Node: The master node runs a daemon,Nimbus,which is responsible for distributing the code around the cluster,assign- ing the tasks, and monitoring failures. It is similar to the Job Tracker in Hadoop. • Worker Node:The worker node runs a daemon,Supervisor,which listens to the work assigned and runs the worker process based on requirements.Each worker node executes a subset of a topol- ogy.The coordination between Nimbus and several supervisors is managed by a Zookeeper system or cluster. Zookeeper Zookeeper is responsible for maintaining the coordination service be- tween the supervisor and master.The logic for a real-time application is packaged into a Storm “topology.” A topology consists of a graph of spouts (data sources) and bolts (data operations) that are connected with stream groupings (coordination). Let’s look at these terms in greater depth. • Spout: In simple terms, a spout reads the data from a source for use in the topology.A spout can either be reliable or unreliable. A reliable spout makes sure to resend a tuple (which is an or- dered list of data items) if Storm fails to process it.An unreliable spout does not track the tuple once it’s emitted. The main method in a spout is nextTuple().This method either emits a new tuple to the topology or it returns if there is nothing to emit. • Bolt: A bolt is responsible for all the processing that happens in a topology. Bolts can do anything from filtering to joins, aggrega- tions, talking to files/databases, and so on. Bolts receive the data from a spout for processing,which may further emit tuples to an- other bolt in case of complex stream transformations.The main method in a bolt is execute(), which accepts a tuple as input. In both the spout and bolt,to emit the tuple to more than one stream, the streams can be declared and specified in declareStream(). • Stream Groupings: A stream grouping defines how a stream should be partitioned among the bolt’s tasks.There are built-in www.drdobbs.com [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 24
  • 25. www.drdobbs.com stream groupings (http://is.gd/eJvL0f) provided by Storm:shuf- fle grouping, fields grouping, all grouping, one grouping, direct grouping, and local/shuffle grouping. Custom implementation by using the CustomStreamGrouping interface can also be added. Implementation For our use case,we designed one topology of spout and bolt that can process a huge amount of data (log files) designed to trigger an alarm when a specific value crosses a predefined threshold. Using a Storm topology, the log file is read line by line and the topology is designed to monitor the incoming data.In terms of Storm components,the spout reads the incoming data. It not only reads the data from existing files, but it also monitors for new files. As soon as a file is modified, spout reads this new entry and,after converting it to tuples (a format that can be read by a bolt), emits the tuples to the bolt to perform threshold analysis,which finds any record that has exceeded the threshold. The next section explains the use case in detail. Threshold Analysis In this article,we will be mainly concentrating on two types of thresh- old analysis:instant threshold and time series threshold. • Instant threshold checks if the value of a field has exceeded the threshold value at that instant and raises a trigger if the condi- tion is satisfied. For example, it raises a trigger if the speed of a vehicle exceeds 80 km/h. • Time series threshold checks if the value of a field has exceeded the threshold value for a given time window and raises a trig- ger if the same is satisfied. For example, it raises a trigger if the speed of a vehicle exceeds 80 km/h more than once in last five minutes. Listing One shows a log file of the type we’ll use,which contains ve- hicle data information such as vehicle number,speed at which the ve- hicle is traveling,and location in which the information is captured. Listing One: A log file with entries of vehicles passing through the checkpoint. AB 123, 60, North city BC 123, 70, South city CD 234, 40, South city DE 123, 40, East city EF 123, 90, South city GH 123, 50, West city A corresponding XML file is created, which consists of the schema for the incoming data. It is used for parsing the log file. The schema XML and its corresponding description are shown in the Table 1. [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 25 Table 1.
  • 26. www.drdobbs.com The XML file and the log file are in a directory that is monitored by the spout constantly for real-time changes. The topology we use for this example is shown in Figure 1. As shown in Figure 1,the FileListenerSpout accepts the input log file, reads the data line by line, and emits the data to the Thresold- CalculatorBolt for further threshold processing.Once the process- ing is done, the contents of the line for which the threshold is calcu- lated is emitted to the DBWriterBolt, where it is persisted in the database (or an alert is raised). The detailed implementation for this process is explained next. Spout Implementation Spout takes a log file and the XML descriptor file as the input.The XML file consists of the schema corresponding to the log file.Let us consider an example log file,which has vehicle data information such as vehicle number,speed at which the vehicle is travelling,and location in which the information is captured.(See Figure 2.) Listing Two shows the specific XML file for a tuple, which specifies the fields and the delimiter separating the fields in a log file. Both the XML file and the data are kept in a directory whose path is specified in the spout. Listing Two: An XML file created for describing the log file. <TUPLEINFO> <FIELDLIST> <FIELD> <COLUMNNAME>vehicle_number</COLUMNNAME> <COLUMNTYPE>string</COLUMNTYPE> </FIELD> <FIELD> <COLUMNNAME>speed</COLUMNNAME> <COLUMNTYPE>int</COLUMNTYPE> </FIELD> <FIELD> <COLUMNNAME>location</COLUMNNAME> <COLUMNTYPE>string</COLUMNTYPE> </FIELD> </FIELDLIST> <DELIMITER>,</DELIMITER> </TUPLEINFO> [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 26 Figure 1:Topology created in Storm to process real-time data. Figure 2:Flow of data from log files to Spout.
  • 27. www.drdobbs.com An instance of spout is initialized with constructor parameters of Di- rectory, Path, and TupleInfo object.The TupleInfo object stores necessary information related to log file such as fields, delimiter, and type of field. This object is created by serializing the XML file using XStream (http://xstream.codehaus.org/). Spout implementation steps are: • Listen to changes on individual log files. Monitor the directory for the addition of new log files. • Convert rows read by the spout to tuples after declaring fields for them. • Declare the grouping between spout and bolt,deciding the way in which tuples are given to bolt. The code for spout is shown in Listing Three. Listing Three: Logic in Open, nextTuple, and declareOutputFields methods of spout. public void open( Map conf, TopologyContext context,SpoutOutputCollector collector ) { _collector = collector; try { fileReader = new BufferedReader(new FileReader(new File(file))); } catch (FileNotFoundException e) { System.exit(1); } } public void nextTuple() { protected void ListenFile(File file) { Utils.sleep(2000); RandomAccessFile access = null; String line = null; try { while ((line = access.readLine()) != null) { if (line !=null) { String[] fields=null; if (tupleInfo.getDelimiter().equals(“|”)) fields = line.split (“”+tupleInfo.getDelimiter()); else fields = line.split(tupleInfo.getDelimiter()); if (tupleInfo.getFieldList().size() == fields.length) _collector.emit(new Values(fields)); } } } catch (IOException ex) { } } } public void declareOutputFields(OutputFieldsDeclarer declarer) { String[] fieldsArr = new String [tupleInfo.getFieldList().size()]; for(int i=0; i<tupleInfo.getFieldList().size(); i++) { fieldsArr[i] = tupleInfo.getFieldList().get(i).getColumnName(); } declarer.declare(new Fields(fieldsArr)); } declareOutputFields() decides the format in which the tuple is emitted, so that the bolt can decode the tuple in a similar fashion. [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 27
  • 28. www.drdobbs.com Spout keeps on listening to the data added to the log file and as soon as data is added,it reads and emits the data to the bolt for processing. Bolt Implementation The output of spout is given to bolt for further processing.The topol- ogy we have considered for our use case consists of two bolts as shown in Figure 3. ThresholdCalculatorBolt The tuples emitted by spout are received by the ThresholdCalcu- latorBolt for threshold processing. It accepts several inputs for threshold check.The inputs it accepts are: • Threshold value to check • Threshold column number to check • Threshold column data type • Threshold check operator • Threshold frequency of occurrence • Threshold time window A class,shown Listing Four,is defined to hold these values. Listing Four: ThresholdInfo class. public class ThresholdInfo implements Serializable { private String action; private String rule; private Object thresholdValue; private int thresholdColNumber; private Integer timeWindow; private int frequencyOfOccurence; } Based on the values provided in fields, the threshold check is made in the execute() method as shown in Listing Five. The code mostly consists of parsing and checking the incoming values. Listing Five: Code for threshold check. public void execute(Tuple tuple, BasicOutputCollector collector) { if(tuple!=null) { List<Object> inputTupleList = (List<Object>) tuple.getValues(); int thresholdColNum = thresholdInfo.getThresholdColNumber(); Object thresholdValue = thresholdInfo.getThresholdValue(); String thresholdDataType = tupleInfo.getFieldList().get(thresholdColNum-1) [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 28 Figure 3:Flow of data from Spout to Bolt.
  • 29. www.drdobbs.com .getColumnType(); Integer timeWindow = thresholdInfo.getTimeWindow(); int frequency = thresholdInfo.getFrequencyOfOccurence(); if(thresholdDataType.equalsIgnoreCase(“string”)) { String valueToCheck = inputTupleList.get(thresholdColNum-1).toString(); String frequencyChkOp = thresholdInfo.getAction(); if(timeWindow!=null) { long curTime = System.currentTimeMillis(); long diffInMinutes = (curTime-startTime)/(1000); if(diffInMinutes>=timeWindow) { if(frequencyChkOp.equals(“==”)) { if(valueToCheck.equalsIgnoreCase (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit (inputTupleList,collector); } } else if(frequencyChkOp.equals(“!=”)) { if(!valueToCheck.equalsIgnoreCase (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit(inputTupleList, collector); } } else System.out.println(“Operator not supported”); } } else { if(frequencyChkOp.equals(“==”)) { if(valueToCheck.equalsIgnoreCase (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit(inputTupleList,collector); } } else if(frequencyChkOp.equals(“!=”)) { if(!valueToCheck.equalsIgnoreCase (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit(inputTupleList,collector); } } } } else if(thresholdDataType.equalsIgnoreCase(“int”) || thresholdDataType.equalsIgnoreCase(“double”) || thresholdDataType.equalsIgnoreCase(“float”) || thresholdDataType.equalsIgnoreCase(“long”) || thresholdDataType.equalsIgnoreCase(“short”)) { String frequencyChkOp = thresholdInfo.getAction(); if(timeWindow!=null) { long valueToCheck = Long.parseLong(inputTupleList. get(thresholdColNum-1).toString()); long curTime = System.currentTimeMillis(); long diffInMinutes = (curTime-startTime)/(1000); System.out.println(“Difference in minutes=”+diffInMinutes); if(diffInMinutes>=timeWindow) [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 29
  • 30. www.drdobbs.com { if(frequencyChkOp.equals(“<”)) { if(valueToCheck < Double.parseDouble (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit(inputTupleList,collector); } } else if(frequencyChkOp.equals(“>”)) { if(valueToCheck > Double.parseDouble (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit(inputTupleList,collector); } } else if(frequencyChkOp.equals(“==”)) { if(valueToCheck == Double.parseDouble (thresholdValue.toString())) { count.incrementAndGet(); if(count.get() > frequency) splitAndEmit(inputTupleList,collector); } } else if(frequencyChkOp.equals(“!=”)) { . . . } } } else splitAndEmit(null,collector); } else { System.err.println(“Emitting null in bolt”); splitAndEmit(null,collector); } } The tuples emitted by the threshold bolt are passed to the next cor- responding bolt,which is the DBWriterBolt bolt in our case. DBWriterBolt The processed tuple has to be persisted for raising a trigger or for fur- ther use.DBWriterBolt does the job of persisting the tuples into the database.The creation of a table is done in prepare(), which is the first method invoked by the topology. Code for this method is given in Listing Six. Listing Six: Code for creation of tables. public void prepare( Map StormConf, TopologyContext context ) { try { Class.forName(dbClass); } catch (ClassNotFoundException e) { System.out.println(“Driver not found”); e.printStackTrace(); } try { connection driverManager.getConnection( “jdbc:mysql://”+databaseIP+”: ”+databasePort+”/”+databaseName, userName, pwd); connection.prepareStatement (“DROP TABLE IF EXISTS “+tableName).execute(); [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 30
  • 31. www.drdobbs.com StringBuilder createQuery = new StringBuilder( “CREATE TABLE IF NOT EXISTS “+tableName+”(“); for(Field fields : tupleInfo.getFieldList()) { if(fields.getColumnType().equalsIgnoreCase(“String”)) createQuery.append(fields.getColumnName()+” VARCHAR(500),”); else createQuery.append(fields.getColumnName()+” “+fields.getColumnType()+”,”); } createQuery.append(“thresholdTimeStamp timestamp)”); connection.prepareStatement(createQuery.toString()).execute(); // Insert Query StringBuilder insertQuery = new StringBuilder(“INSERT INTO “+tableName+”(“); String tempCreateQuery = new String(); for(Field fields : tupleInfo.getFieldList()) { insertQuery.append(fields.getColumnName()+”,”); } insertQuery.append(“thresholdTimeStamp”).append(“) values (“); for(Field fields : tupleInfo.getFieldList()) { insertQuery.append(“?,”); } insertQuery.append(“?)”); prepStatement = connection.prepareStatement(insertQuery.toString()); } catch (SQLException e) { e.printStackTrace(); } } Insertion of data is done in batches.The logic for insertion is provided in execute() as shown in Listing Seven,and consists mostly of pars- ing the variety of different possible input types. Listing Seven: Code for insertion of data. public void execute(Tuple tuple, BasicOutputCollector collector) { batchExecuted=false; if(tuple!=null) { List<Object> inputTupleList = (List<Object>) tuple.getValues(); int dbIndex=0; for(int i=0;i<tupleInfo.getFieldList().size();i++) { Field field = tupleInfo.getFieldList().get(i); try { dbIndex = i+1; if(field.getColumnType().equalsIgnoreCase(“String”)) prepStatement.setString(dbIndex, inputTupleList.get(i).toString()); else if(field.getColumnType().equalsIgnoreCase(“int”)) prepStatement.setInt(dbIndex, Integer.parseInt(inputTupleList.get(i).toString())); else if(field.getColumnType().equalsIgnoreCase(“long”)) prepStatement.setLong(dbIndex, Long.parseLong(inputTupleList.get(i).toString())); else if(field.getColumnType().equalsIgnoreCase(“float”)) prepStatement.setFloat(dbIndex, Float.parseFloat(inputTupleList.get(i).toString())); else if(field.getColumnType(). equalsIgnoreCase(“double”)) prepStatement.setDouble(dbIndex, Double.parseDouble(inputTupleList.get(i).toString())); else if(field.getColumnType().equalsIgnoreCase(“short”)) prepStatement.setShort(dbIndex, Short.parseShort(inputTupleList.get(i).toString())); else if(field.getColumnType(). equalsIgnoreCase(“boolean”)) prepStatement.setBoolean(dbIndex, Boolean.parseBoolean(inputTupleList.get(i).toString())); [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 31
  • 32. www.drdobbs.com else if(field.getColumnType().equalsIgnoreCase(“byte”)) prepStatement.setByte(dbIndex, Byte.parseByte(inputTupleList.get(i).toString())); else if(field.getColumnType().equalsIgnoreCase(“Date”)) { Date dateToAdd=null; if (!(inputTupleList.get(i) instanceof Date)) { DateFormat df = new SimpleDateFormat (“yyyy-MM-dd hh:mm:ss”); try { dateToAdd = df.parse(inputTupleList.get(i).toString()); } catch (ParseException e) { System.err.println(“Data type not valid”); } } else { dateToAdd = (Date)inputTupleList.get(i); java.sql.Date sqlDate = new java.sql. Date(dateToAdd.getTime()); prepStatement.setDate(dbIndex, sqlDate); } } catch (SQLException e) { e.printStackTrace(); } } Date now = new Date(); try { prepStatement.setTimestamp(dbIndex+1, new java.sql.Timestamp(now.getTime())); prepStatement.addBatch(); counter.incrementAndGet(); if (counter.get()== batchSize) executeBatch(); } catch (SQLException e1) { e1.printStackTrace(); } } else { long curTime = System.currentTimeMillis(); long diffInSeconds = (curTime-startTime)/(60*1000); if(counter.get() < batchSize && diffInSeconds>batchTimeWindowInSeconds) { try { executeBatch(); startTime = System.currentTimeMillis(); } catch (SQLException e) { e.printStackTrace(); } } } } public void executeBatch() throws SQLException { batchExecuted=true; prepStatement.executeBatch(); counter = new AtomicInteger(0); } [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 32
  • 33. www.drdobbs.com Once the spout and bolt are ready to be executed,a topology is built by the topology builder to execute it.The next section explains the ex- ecution steps. Running and Testing the Topology in a Local Cluster Define the topology using TopologyBuilder,which exposes the Java API for specifying a topology for Storm to execute: • Using Storm Submitter, we submit the topology to the cluster. It takes name of the topology,configuration,and topology as input. • Submit the topology. Listing Eight: Building and executing a topology. public class StormMain { public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException, InterruptedException { ParallelFileSpout parallelFileSpout = new ParallelFileSpout(); ThresholdBolt thresholdBolt = new ThresholdBolt(); DBWriterBolt dbWriterBolt = new DBWriterBolt(); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“spout”, parallelFileSpout, 1); builder.setBolt(“thresholdBolt”, thresholdBolt,1). shuffleGrouping(“spout”); builder.setBolt(“dbWriterBolt”,dbWriterBolt,1). shuffleGrouping(“thresholdBolt”); if(this.argsMain!=null && this.argsMain.length > 0) { conf.setNumWorkers(1); StormSubmitter.submitTopology( this.argsMain[0], conf, builder.createTopology()); } else { Config conf = new Config(); conf.setDebug(true); conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology( “Threshold_Test”, conf, builder.createTopology()); } } } After building the topology,it is submitted to the local cluster.Once the topology is submitted,it runs until it is explicitly killed or the cluster is shut down without requiring any modifications.This is another big advantage of Storm. This comparatively simple example shows the ease with which it’s possible to set up and use Storm once you understand the basic con- cepts of topology, spout, and bolt. The code is straightforward and both scalability and speed are provided by Storm. So, if you’re look- ing to handle big data and don’t want to traverse the Hadoop uni- verse, you might well find that using Storm is a simple and elegant solution. —ShruthiKumarworksasatechnologyanalystandSiddharthPatankarisasoftware engineerwiththeCloudCenterofExcellenceatInfosysLabs. [STORM] INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> November 2013 33
  • 34. INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> www.drdobbs.com Items of special interest posted on www.drdobbs.com over the past month that you may have missed IF JAVA IS DYING, IT SURE LOOKS AWFULLY HEALTHY The odd, but popular, assertion that Java is dying can be made only in spite of the evidence, not because of it. http://www.drdobbs.com/240162390 CONTINUOUS DELIVERY: THE FIRST STEPS Continuous delivery integrates many practices that in their totality might seem daunting. But starting with a few basic steps brings im- mediate benefits.Here’s how. http://www.drdobbs.com/240161356 A SIMPLE, IMMUTABLE, NODE-BASED DATA STRUCTURE Array-like data structures aren’t terribly useful in a world that doesn’t allow data to change because it’s hard to implement even such simple operations as appending to an array efficiently.The difficulty is that in an environment with immutable data, you can’t just append a value to an array; you have to create a new array that contains the old array along with the value that you want to append. http://www.drdobbs.com/240162122 DIJKSTRA’S 3 RULES FOR PROJECT SELECTION Want to start a unique and truly useful open-source project? These three guidelines on choosing wisely will get you there. http://www.drdobbs.com/240161615 PRIMITIVE VERILOG Verilog is decidedly schizophrenic. There is part of the Verilog lan- guage that synthesizers can commonly convert into FPGA logic, and then there is an entire part of the language that doesn’t synthesize. http://www.drdobbs.com/240162355 DEVELOPING ANDROID APPS WITH SCALA AND SCALOID: PART 2 Starting with templates, Android features can be added quickly with a single line of DSL code. http://www.drdobbs.com/240162204 FIDGETY USB Linux-based boards like the Raspberry Pi or the Beagle Bone usually have some general-purpose I/O capability,but it is easy to forget they also sport USB ports. http://www.drdobbs.com/240162050 This Month on DrDobbs.com [LINKS] November 2013 34
  • 35. INFORMATIONWEEK Rob Preston VP and Editor In Chief,Information- Week rob.preston@ubm.com 516-562-5692 Chris Murphy Editor,InformationWeek chris.murphy@ubm.com 414-906-5331 Lorna Garey Content Director,Reports,Informa- tionWeek lorna.garey@ubm.com 978-694-1681 Brian Gillooly,VP and Editor In Chief,Events brian.gillooly@ubm.com INFORMATIONWEEK.COM Laurianne McLaughlin Editor laurianne.mclaughlin@ubm.com 516-562-5336 Roma Nowak Senior Director, Online Operations and Production roma.nowak@ubm.com 516-562-5274 Joy Culbertson Web Producer joy.culbertson@ubm.com Atif Malik Director, Web Development atif.malik@ubm.com MEDIA KITS http://createyournextcustomer.techweb.com/media- kit/business-technology-audience-media-kit/ UBM TECH AUDIENCE DEVELOPMENT Director,Karen McAleer (516) 562-7833, karen.mcaleer@ubm.com SALES CONTACTS—WEST Western U.S.(Pacific and Mountain states) and Western Canada (British Columbia, Alberta) Sales Director,Michele Hurabiell (415) 378-3540,michele.hurabiell@ubm.com Strategic Accounts Account Director,Sandra Kupiec (415) 947-6922,sandra.kupiec@ubm.com Account Manager,Vesna Beso (415) 947-6104, vesna.beso@ubm.com Account Executive,Matthew Cohen-Meyer (415) 947-6214, matthew.meyer@ubm.com MARKETING VP,Marketing,Winnie Ng-Schuchman (631) 406-6507, winnie.ng@ubm.com Marketing Director,Angela Lee-Moll (516) 562-5803,angele.leemoll@ubm.com Marketing Manager,Monique Luttrell (949) 223-3609, monique.luttrell@ubm.com Program Manager,Nicole Schwartz 516-562-7684,nicole.schwartz@ubm.com SALES CONTACTS—EAST Midwest,South,Northeast U.S.and Eastern Canada (Saskatchewan,Ontario,Quebec,New Brunswick) District Manager,Steven Sorhaindo (212) 600-3092,steven.sorhaindo@ubm.com Strategic Accounts District Manager,Mary Hyland (516) 562-5120, mary.hyland@ubm.com Account Manager,Tara Bradeen (212) 600-3387, tara.bradeen@ubm.com Account Manager,Jennifer Gambino (516) 562-5651, jennifer.gambino@ubm.com Account Manager,Elyse Cowen (212) 600-3051, elyse.cowen@ubm.com Sales Assistant, Kathleen Jurina (212) 600-3170, kathleen.jurina@ubm.com BUSINESS OFFICE General Manager, Marian Dujmovits United Business Media LLC 600 Community Drive Manhasset,N.Y.11030 (516) 562-5000 Copyright 2013. All rights reserved. November 2013 35www.drdobbs.com UBM TECH Paul Miller,CEO Robert Faletra,CEO,Channel Kelley Damore,Chief Community Officer Marco Pardi,President,Business Technology Events Adrian Barrick,Chief Content Officer David Michael,Chief Information Officer Sandra Wallach CFO Simon Carless,EVP,Game & App Development and Black Hat Lenny Heymann,EVP,New Markets Angela Scalpello,SVP,People & Culture Andy Crow,Interim Chief of Staff UNITED BUSINESS MEDIA LLC Pat Nohilly Sr.VP,Strategic Development and Business Administration Marie Myers Sr.VP, Manufacturing UBM TECH ONLINE COMMUNITIES Bank Systems & Tech Dark Reading DataSheets.com Designlines Dr.Dobb’s EBN EDN EE Times EE Times University Embedded Gamasutra GAO Heavy Reading InformationWeek IW Education IW Government IW Healthcare Insurance & Technology Light Reading Network Computing Planet Analog Pyramid Research TechOnline Wall Street & Tech UBM TECH EVENT COMMUNITIES 4G World App Developers Conference ARM TechCon Big Data Conference Black Hat Cloud Connect DESIGN DesignCon E2 Enterprise Connect ESC Ethernet Expo GDC GDC China GDC Europe GDC Next GTEC HDI Conference Independent Games Festival Interop Mobile Commerce World Online Marketing Summit Telco Vision Tower & Cell Summit http://createyournextcustomer.techweb.com AndrewBinstock Editor in Chief,Dr.Dobb’s andrew.binstock@ubm.com DeirdreBlake Managing Editor,Dr.Dobb’s deirdre.blake@ubm.com AmyStephens Copyeditor,Dr.Dobb’s amy.stephens@ubm.com JonErickson Editor in Chief Emeritus,Dr.Dobb’s CONTRIBUTING EDITORS Scott Ambler Mike Riley Herb Sutter DR.DOBB’S UBM TECH EDITORIAL 303 Second Street, 751 Laurel Street #614 Suite 900,SouthTower San Carlos,CA San Francisco,CA 94107 94070 1-415-947-6000 USA INTHISISSUE GuestEditorial>> News >> Open-SourceDashboard>> WhatBigDataCanDeliver>> Lambda>> Storm>> Links>> TableofContents >> Entire contents Copyright © 2013, UBM Tech/United Busi- nessMediaLLC,exceptwhere otherwise noted. No portion of this publication may be re- produced,stored,transmitted in any form, including com- puter retrieval, without writ- ten permission from the publisher. All Rights Re- served. Articles express the opinion of the author and are notnecessarilytheopinionof the publisher. Published by UBM Tech/United Business Media, 303 Second Street, Suite 900 South Tower, San Francisco, CA 94107 USA 415-947-6000.