2. Table of Content
1. Background ............................................................................................................ 3
2. Problem Statement ................................................................................................. 3
3. Proposed Architecture—High-Level ........................................................................ 4
a. 100GB-scale Data Volume ..................................................................................... 4
b. Log Files as Data Source ........................................................................................ 4
c. Customer-facing OLAP ........................................................................................... 4
4. Proposed Architecture—Low-Level ......................................................................... 6
a. Hadoop ................................................................................................................... 6
b. Data Marts .............................................................................................................. 9
i. One vs. Many.......................................................................................................... 9
ii. Brand of RDBMS .................................................................................................. 10
c. Reporting Portal .................................................................................................... 11
d. Hardware .............................................................................................................. 12
e. Java Programming ................................................................................................ 12
5. Data Anomaly Detection ....................................................................................... 12
6. Data integration/importation and Data Quality Management ................................. 12
7. Summary .............................................................................................................. 13
Appendix A. Hadoop Overview ................................................................................... 14
MapReduce ................................................................................................................ 14
Map............................................................................................................................. 14
Reduce ....................................................................................................................... 14
Hadoop Distributed File System (HDFS) ..................................................................... 16
8. Query Optimization ............................................................................................... 18
9. Access and Data Security ..................................................................................... 18
10. Internal Management and Collaboration tools ....................................................... 18
11. Sales Force and Force.com integration ................................................................ 19
12. Roadmap .............................................................................................................. 20
2 ... Architecture Proposal Confidential
3. 1. Background
<Company presentation and background – Confidential>
2. Problem Statement
In term of load for the database, the number of sites is the best metric since it describes the number …. So
it is very important that the web-application remains effective as the company is growing (this includes the
database, the framework and the architecture of the servers). Also, as the company grows in … need to
deploy a server in Europe to manage ...
In addition, the historical data will be kept and the number of ... will grow the data volume will grow
exponentially. So the overall database architecture needs to be highly and easily scalable.
It is also more than likely as the solution price will decrease, bigger corporations will be interested in ...
solution. Therefore, ... solution will need to be integrated in existing information systems.
This will require:
To interface ... solution to existing applications.
To have ... solution relying on standard and open technologies.
To build partnership with System Integrators or build an internal Professional Services organization
to support these customers.
With its current, somewhat limited database schema, the data warehouse’s millions of records consume
more than 2GB of disk space, including indexes. Extensions to the data warehouse schema, coupled with
a growing customer base, will easily push the data warehouse volume beyond 100GB. The single
instance, multi-schema MySQL database architecture simply does not provide the scalability necessary to
meet ... demands.
In addition to these scalability problems, the reporting infrastructure is also limited in its potential for
enhanced functionality. For instance, ... would like to extend the Reporting Portal to provide customers
with ad-hoc, multi-dimensional query capability and custom reporting based on searchable attribute tags in
the data warehouse. At present, the data warehouse dimensions do not provide the flexibility needed to
easily accommodate these kinds of changes.
Therefore, ... has a pressing need to replace its current reporting infrastructure with a scalable, flexible
architecture that can not only accommodate their growing data volumes, but also dramatically extend their
reporting functionality. Key goals for the new infrastructure include:
Redundant, efficient retention of historical detail
o Write once, ready many
o Compression
o No encryption required
o ANSI-7 single-byte code page is sufficient
Linear scalability (i.e., as data volume increases, performance is not degraded)
Flexible extensibility (e.g., attributes can easily be added and exposed to customers for reporting,
either as dimensional attributes or fact attributes)
3 ... Architecture Proposal Confidential
4. Full OLAP support
o Standard reports
o Custom reports
o Ad-hoc query
o Multi-dimensional
o Hierarchical categories (i.e., tagging, snowflakes)
o Charts and graphs
o Drill-down to atomic detail (i.e., ... log)
o 24x7 availability
o Query response time measured in seconds (not minutes)
Efficient ETL
o Near real time (i.e., < 15 minutes)
o Handles fluctuating volumes throughout the day without becoming a bottleneck (which can
cause synchronization problems in the data warehouse)
Partitioning of data by customer
This new architecture must deliver vastly improved functionality, while controlling for implementation cost
and time to roll-out.
3. Proposed Architecture—High-Level
From an architectural perspective, there are three overarching factors driving the technical solution for
... reporting needs:
a. 100GB-scale Data Volume
Due to their sheer size, large applications like ...s data warehouse require more resources than
can typically be served by a single, cost-effective machine. Even if a large, expensive server
could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a
single machine could provide the continuous, uninterrupted operation needed to meet ... SLAs.
A cloud computing architecture, on the other hand, is an economical, scalable solution that
provides seamless fault tolerance for large data applications.
b. Log Files as Data Source
More and more organizations are seeking to leverage the rich content in their verbose log files
to drive business intelligence. Sourcing from log files presents a different set of challenges
compared to selecting data out of a highly structured OLTP database. Efficient, robust, and
flexible parsing routines most be programmed to identify tagged attributes and map these to
business constructs in the data warehouse. And because log files tend to consume lots of disk
space, they should ideally be stored in a distributed file system in order to load balance I/O and
improve fault tolerance.
c. Customer-facing OLAP
The stakes are usually higher when building and maintaining a customer-facing business
intelligence solution, as opposed to one that is implementing internally. ... reputation and
marketability depend in part on its customers’ opinions of the Reporting Portal. It must be
intuitive, easy to use, powerful, secure, and available anytime. Its data should be as fresh as
possible, while providing historical data for trend analyses. Customers should have seamless
access to both aggregated metrics and ... log detail. The Reporting Portal should expose the
customizability of the speech application through its reports. Any customer-specific categories,
tags, and data content should be faithfully reflected in the Reporting Portal, just as the customer
would expect to see them.
4 ... Architecture Proposal Confidential
5. Based on these driving factors, we propose a cloud computing architecture comprising a distributed file
system, distributed file processing, one or more relational data marts, and a browser-based OLAP
package (see Figure 1). Most of this infrastructure will be built using open source software
technologies running on commodity hardware. This strategy keeps initial implementation costs low for
a right-sized solution, while providing a path for scalable growth.
Figure 1. High-Level Architecture
... logs Hadoop Distributed Relational Reporting
File System (HDFS) Data Mart(s) Portal
... logs are retained ...logs are Any portion of Reports, ad-
for ever (or as immediately historical data can hoc queries,
otherwise specified replicated into be read from graphs, and
per customer HDFS and can be Hadoop and charts are
requirements). retained indefinitely. aggregated as
needed into
presented via
optimized reporting browser-based
database(s). software.
In this design, Apache Hadoop (http://hadoop.apache.org/) is used to perform some of the functions
normally provided by a relational data warehouse. Most specifically, Hadoop behaves as the system of
record, storing all of the historical detail generated by the Speech Applications. New ... logs are
immediately replicated into the Hadoop Distributed File System (HDFS), which is massively scalable to
accommodate virtually any amount of data. HDFS is based on Google’s GFS, which essentially stores
the content of the Web in order to facilitate index generation. Other well-known companies that store
huge volumes on data in HDFS include Yahoo!, AOL, Facebook, and Amazon. Hadoop is free to
download and install. It uses a cloud computing architecture (i.e., lots of inexpensive computers linked
together, sharing workload), so it can be easily and economically extended as needed to scale for
growth. Scaling performance is linear; performance does not degrade as you increase data volume.
Hadoop cannot fulfill all of the functions of a data warehouse, though. For instance, it does not contain
indexes like a relational database, so it can’t truly be optimized to return query results quickly. Hadoop
provides a very powerful, distributed job processing technology called MapReduce, which can perform
much of the extract and transform work that is commonly done by ETL tools. Therefore, Hadoop
powerfully augments ... business intelligence architecture by using distributed storage and processing
to perform the data warehousing functions that would otherwise be the hardest to scale under a
traditional, single-machine, relational data warehouse architecture.
5 ... Architecture Proposal Confidential
6. While Hadoop does the ―heavy lifting,‖ other, more traditional technologies are used to provide familiar
business intelligence functionality. Relational data marts serve up optimized OLAP database schemas
(e.g., highly indexed star schemas) for querying via standard business intelligence tools. One defining
factor of a data mart is that it can be completely truncated and reloaded from the upstream data
repository (in this case, Hadoop) as needed. This means that if ... needs to enhance the reporting
database design by altering a dimension or adding new metrics, the data mart’s schema can altered—
even dramatically—and repopulated without the risk of losing any historical data. It’s also worth noting
that because the Hadoop repository stores all historical detail, it is possible to retroactively back-
populate new metrics that re added to the data mart(s).
As of this writing, it is not know how much data volume must be accommodated in a given data mart.
And we don’t yet know whether one data mart would suffice, or if there would be many data marts.
These questions will influence the choice of relational database management system (RDBMS) that is
selected for .... For example, MySQL is cheap to procure and implement, but has serious scalability
limitations. A columnar MPP database like ParAccel is ideal for handling multi-terabyte data volumes,
but comes with a price tag. One advantage of this proposed architecture, though, is that the data marts
can be migrated from one technology to another without risk of losing valuable data.
The customer-facing front end technology should be a mature, fully-supported product like
BusinessObjects or MicroStrategy. Such technologies are rich with features that would otherwise be
very costly to develop in-house, even with open source Java libraries. Besides, the customers who use
this interface should not become quality assurance testers for internally developed user interfaces. The
Reporting Portal is a marketed service and as such, must leave customers with a great impression.
4. Proposed Architecture—Low-Level
This section outlines an in-depth look at each component in Figure 1 above.
a. Hadoop
Hadoop is an extremely powerful open source technology that does certain things very well, like
store immense volumes of data and perform distributed computations on that data. Some of
these strengths can be leveraged within the context of a business intelligence application.
For instance, several of the functions that would normally be performed within a traditional data
warehouse could be taken up by Hadoop. One defining feature of a data warehouse is that it
stores historical data. While source systems may only keep a rolling window of recent data, the
data warehouse retains all or most of the history. This frees up the transactional systems to
efficiently run the business, while keeping a historical system of record in the data warehouse.
HDFS is ideal for archiving large volumes of static data, such as ... ... logs. HDFS provides
linear scalability as data volumes increase. Not only can HDFS easily handle ... for ever
retention requirement, but if could also permit ... to retain all of its history. HDFS comfortably
scales into the petabyte range, so the need to age out and purge files could be eliminated
altogether.
Hadoop is a perfect solution for historicity problems, because it easily scales to petabyte sizes
by simply by configuring additional hardware into the cluster.
Another benefit of HDFS is its data redundancy. HDFS replicates file blocks across nodes,
which can physically reside in the same data center or in another data center (assuming the
VPN bandwidth supports it). This would entirely eliminate the need for ... to copy zipped ... log
files between data centers (see Figure 2).
6 ... Architecture Proposal Confidential
7. Figure 2. ... Log-Hadoop Architecture
VXML
… Logs
or
or
or
or
or
or
or
or
or
or
or
or
or
or
or
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
e
e
e
e
e
e
e
e
e
d
e
d
e
d
e
d
e
d
e
d
e
d
d
d
d
d
d
d
d
d
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
S S S
V
V KV KV
KV
KpV KpV KpV
RJ R
T Klial KliV Klial
V e V
K al K al K al
V
V
e al e al S al e V
Na
o a
T e KSV e al K e al
e tal e tal e tal
al e u e u
u Vy u y u V
a y K Vy uK y u
u y u hu
u y
hu
y K al V
y
V
A Java program a
T
b
ck T
ck
a y e al 2 e K 1 u
uV
uV y
V
e al 2 e e 7 u
al
y
5 Kffl 3 e Kffl e
e al
e 4 ee 2 e V
al e
al
V
T s
T 6 eRu 6 eRu e
e al
8 e al
e u
y u 7e 8
e
0
V
al e
4 1
al 2
Aeu 8 y eal 3
reads each ... log ma a u
u
or
or
or
or
or
or
or
or
or
or
T y eu 9 y eal 4
B
R
R
R
R
R
R
R
R
R
R
s
e
e
e
e
e
e
e
d
e
d
e
d
e
d
d
d
d
d
d
d
al 5
c
c
c
c
c
c
c
c
c
c
k Cu u
T
a
e T
a y e
D
1 e
y e 2 u
y e
yAeu
e
and writes it into s
Tr s
k
T 3Ae
5d0e 4du
6 e
u
e
3
e
HDFS for Na
s a
s 7 A
nd
1
e
6 8 7
nd8
e
e
4
k HDFS T
a
a k
a uB
2 9
uC
permanent s
k s
kr MapReduce
S S5D
o
T T
c(Figure A-2)
s sr c c
(Figure A-1)
storage. k
T a
k
T ort
e ort
e
d
kr a
kr
Tr c
Tr T T
e
a
e
T a
c
T a a
ar k
ar
cr
r c
kr s s
a
c e
a
c k k
k
The a k
e
Hadoop Distributed File System (HDFS) can
a
c
k r/
c
k transparently replicate data
e
c e
be configuredr/
cto
k
e D
k
e
r/
k r/
across racks D across data centers, providing
and
k
e
r/ a
e
r/
redundant failover copies of all file blocks.
De D
a
e
Dr/ t
r/
D
a
r/ at
r/
Da a
D
a
D t a
Dt
at N
at
Although business intelligence solutions depend on lots of data, business users are interested in
a
a a
N
a
at ot
a of raw data into meaningful business metrics,
information. In order to transform large volumesN
N t o
dt
a a
N must be applied, and large numbers of data elements
N
calculations must be performed, business rules o
a o
d
a
No e
N
o
must be summarized into a few figures. Nd d
e
N
o
d o
d
e
o e
o
d
e d
e
Traditionally, this type of aggregation worke d is done outside of the data warehouse by an extract,
d
e
e e
transform, and load (ETL) tool, or within the data warehouse using stored procedures and materialized
views. Due to the inherent constraints imposed by a relational database system like MySQL, there are
limits to how much data can reasonably be aggregated this way. As source data volumes increase, the
time required to perform aggregations can extend beyond the point in time when the resulting metrics
are needed by the customers.
Hadoop is able to perform these kinds of aggregations much quicker on large data volumes because it
distributes the processing across many computers, each one crunching the numbers for a subset of the
source data. Consequently, aggregated metrics that might have taken days to calculate in a traditional
data warehouse model can be churned out by Hadoop in a couple of hours or even minutes.
MapReduce is particularly well-suited to structured data sets like ... ... logs. Tagged attributes map
easily to key/value pairs, which are the transactional unit of MapReduce jobs (see Figure A-1 in
appendix). ... ETL routines could therefore be replaced with Java MapReduce jobs read from HDFS ...
log files and write to the data marts (see Figure 3).
7 ... Architecture Proposal Confidential
8. Figure 3. Hadoop MapReduce Architecture
Other
Tools
…
Relational
or
or
or
or
or
or
or
or
or
or
or
or
or
or
or
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
e
e
e
e
e
e
e
e
e
d
e
d
e
d
e
d
e
d
e
d
e
d
d
d
d
d
d
d
d
d
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
S S S Data Mart(s)
RJ R KpV KpV KpV
V KV KV
V KV KV
K al K al K al
V e V K al
T Klial
V e V
e V
e al e lial S lial
N
ao a
T e KSV e al K e al
e tal e tal e tal
al y u e u
u Vy u y u V
a y K Vy uK y u
u y u hu
uhu
y K al V
y
V
T
cka
b T
ck
a y e al 2 e K 1 u
uV
uV
e al
e 4 ee 2 e
y
V
e al 2 e e 7 u
al
y
5 Kffl 3 e Kffl e
V
al e
al
T s
T 6 eRu 6 eRV e
e al
8 e al
e u
y u 7e 8
e
0
V
al e
4 1
al 2
u
Aeu 8 y eal 3
JDBC
ma a u
u
or
or
or
or
or
or
or
or
or
or
T y eu 9 y eal 4
B
R
R
R
R
R
R
R
R
R
R
s
e
e
e
e
e
e
e
d
e
d
e
d
e
d
d
d
d
d
d
d
al 5
c
c
c
c
c
c
c
c
c
c
T k
T Cu
y e
D u
u
e
a a 1 e
y e y u
2 e
Te
sr s
k
T 3Ae
5d0e 4de
yAu
6 3
u
e
e
e
Na
s a
s 7 A
nd
1
e
6 8 e
nd
7
8
e
4
k HDFS T
a
a k
a uB2 uC9
s
k s
kr MapReduce
S S5D
To T
c(Figure A-2)
s sr c c
(Figure A-1)
Tk a
k
T ort
e ort
e
d
kr a
kr
T r c
Tr T T Java programs execute
Te
a
e a
c
T a a
ar k
ar MapReduce jobs to extract and
c
The ar
r c
entire history of ... logs is permanentlysstored
kr s
c e
a
c k k transform any subset of ... log data,
k
a k
e
a
in Hadoop, making it possible to back-populate
c
k r/
c
k and then write the aggregated
newe metrics with old data, perform year-over-
c
kBI D
e
r/
c
k results into the relational data marts
e e
r/trend reports, and manually mine data as
yeark r/
e Dk
a
e via JDBC.
r/
D
needed. r/
D
e
r/ t
r/ a
e
D
a D
a
r/
D a
Dt
r/
at at
D a
D quite a few maturing open source tools that can provide analysts direct access to
N
There a at a
are also
at
at N
a
ot
Hadoop data. N
a
N a
oFor instance, a desktop tool like HBase or Hive can be used as a SQL-like interface into
at d
at
Hadoop, N permitting analysts to run queries in much the same way that they would access a traditional
o N
o
a
N d
a
e
N
o o
data warehouse. These tools might be useful to ... personnel who want to perform analyses that are
d d
N
o e
N
o
d
e d
e
not immediately available through the Reporting Portal. Such tools are best suited for more technically
o
d o
d
e e
literatedanalysts who are comfortable writing their own queries and do not require fast query response
e d
e
time. e e
Cloudera (http://www.cloudera.com/) recently unveiled its browser-based Cloudera Desktop product.
This tool simplifies some of the work required to set up, execute, and monitor MapReduce jobs. For the
more technically inclined analysts in ... organization, Cloudera Desktop might be a good fit—even better
than one of the SQL emulators like HBase. Cloudera Desktop’s main features include:
File Browser – Navigate the Hadoop file system
Job Browser – Examine MapReduce job states
Job Designer – Create MapReduce job designs
Cluster Health – At-a-glance state of the Hadoop cluster
It is also possible to use Hadoop’s MapReduce to generate ―canned reports‖ in batch processing mode.
That is, nightly batch jobs can be scheduled to produce static reports. These reports would consume
data directly from Hadoop, and the resulting content could be pre-formatted for presentation via HTML.
Such reports would effectively by-pass the relational data mart altogether.
8 ... Architecture Proposal Confidential
9. b. Data Marts
Stated simply, Hadoop can make an excellent contribution as a component of a business
intelligence solution, but it cannot be the whole solution. A key limitation is that a data
warehouse is indexed to provide fast query response time, while Hadoop data is not. A data
warehouse (or data mart) typically contains pre-aggregated metrics in order to deliver selected
results as fast as possible (i.e., without re-aggregating on the fly). Therefore, a gating factor in
deciding whether to run analytic queries and reports against Hadoop is the end user’s
expectation for response time. Since ... customers expect and deserve immediate to near-
immediate query performance, directly querying Hadoop is not a viable design for the Reporting
Portal.
It’s also worth noting here that most of the mature, industry-standard OLAP tools like
BusinessObjects and MicroStrategy cannot be coupled directly with Hadoop. Therefore, the ...
reporting infrastructure will still require a traditional, relational, indexed data store containing pre-
aggregated metrics.
This data store is rightly called a data mart, because it is not the historical repository of detailed
data, or system of record. All of its content can be regenerated at any time from the upstream
data source.
... has two basic architectural decisions to make with regard to the data mart. First is whether to
create one data mart or multiple data marts. The second decision is which brand of RDBMS to
implement.
i. One vs. Many
There are a couple of compelling reasons to implement multiple, separate data marts.
One reason is performance. The less data you cram into a relational database, the
faster it generally performs. There can be exceptions to this rule (like ParAccel’s Analytic
Database), but relational databases are usually more responsive with smaller data
volumes.
A second motivation for splitting ... data into multiple marts is security. It’s certainly quite
possible to implement robust security within a single relational database instance, but
physically separating each customer’s data definitely ensures that they cannot see one
another’s content. However, it is strongly recommended that ... not rely solely on
physical separation to enforce data security. There might be situations in which it is not
economical store lots of small customers’ data separately. ... should retain the option to
co-mingle multiple customers’ data in one database instance, while ensuring privacy to
each of them.
9 ... Architecture Proposal Confidential
10. Figure 4. Multiple Data Marts
Relational
Data Marts
Customer A
Customer B
Customer C
System of record
contains all historical
detail.
…
A third reason for implementing multiple data marts is customizability. It’s quite possible
that Customer A might require different kinds of metrics from what Customer B needs.
One data mart would have to be all things to all customers, making it horribly complex.
The turnaround time required to add customer-specific metrics would be greatly
improved by hosting them in a dedicated data mart.
Having multiple data marts would be very similar to ... current reporting architecture,
which uses dedicated MySQL schemas to partition customer data.
ii. Brand of RDBMS
There are several factors influencing ... choice of relational database management
system. The primary factor will likely be data volume, which itself is influenced by many
factors (e.g., data model, historical timeframe, individual customer’s ... log volume).
Therefore, within the context of this proposal, it is not possible to accurately estimate
data sizing. Instead, we can provide some basic guidance for future reference.
From our experience, relatively small volumes (i.e., 10s of GB or less) can be
comfortably accommodated by MySQL. Medium volumes (up to 100s of GB) are better
served by Microsoft SQL Server or Oracle. Large volumes 100s of GB to TB-scale)
require a columnar MPP database like ParAccel Analytical Database, Netezza, Teradata,
Exadata, or Vertica.
In addition to data volumes, ... will likely consider cost. MySQL is free, while other
products can costs hundreds of thousands of dollars to purchase. The cost of a given
RDBMS may also depend in part of the hardware needed to support it. Some RDBMS
products only run on certain brands of hardware. Clearly, this can have far-reaching
ramifications for ... costs of operations. We recommend that ... choose database
software that can run on any Intel-powered, rackable server. Such hardware will provide
the most economical scalability path.
10 ... Architecture Proposal Confidential
11. Table 1. RDBMS Recommendations
Data Volume Brand Notes
Up to 10s of GB MySQL Free, but doesn’t scale well
Up to 100s of Good value for money, easy to run on
Microsoft SQL Server
GB commodity hardware
100s of GB to ParAccel Analytic Powerful, hardware-flexible, negotiable
TB Database pricing model
c. Reporting Portal
... next generation Reporting Portal could provide its customers with a greatly expanded set of
features if it is replaced with an industry-standard business intelligence tool like BusinessObjects
or MicroStrategy.
The choice of such tool will be essentially driven by how ... customers needs change and more
importantly if ... start to have bigger corporations with existing IT architecture as client.
On the short and middle term, an open source tools such as DataVision
http://datavision.sourceforge.net will be a perfect solution allowing producing custom reports
easily and generating the result using XML format.
The XML format will allow to distribute the report almost Operating System agnostic. The only
requirement will be to have XML file reading capabilities on the platform the reports needs to be
visualized.
These web-based tools leverage the power of metadata to enforce security and map business
metrics to back-end data structures. A metadata-based tool flexibly supports business
abstractions like categories and hierarchies that are not inherent to the physical data. Business
intelligence tools offer a rich presentation layer capable of displaying the graphs, charts, and
pivot tables that business users have come to expect from reporting interfaces.
Figure 5. Browser-based Front-end
Relational
Data Marts BI Web
Server
Customer’s
Browser
Internet
Vendor supported business intelligence
... Network
application provides richly featured, web-
based interface. Customers can run
BI Metadata standard and customer reports, ad-hoc
Repository queries, generate charts and graphs, save
results to Excel, etc.
11 ... Architecture Proposal Confidential
12. By leveraging a mature front-end technology, ... gains the advantage of reducing its internal
Java development effort, while giving its customers a greatly expanded set of reporting and
OLAP functionality. There a many products on the market, some cheaper and less mature than
the long-standing industry leaders, Business Objects XI 3.1 and Micro Strategy 9. Our
recommendation to ... is to be willing to invest in this customer-facing component so that it
reinforces the most appealing impression in its end users.
d. Hardware
All of the technologies outlined thus far will run quite well on the type of hardware that ...
currently uses to serve the Reporting Portal’s data warehouse. ... could purchase several more
of the rackable Dell PowerEdge 2950 server trays running Windows Server 2003 and array
them as a Hadoop cluster, data mart hosts, or web servers. Operational considerations like
data center space and power notwithstanding, this hardware choice would preserve ... current
SOE (standard operating environment), and minimize retraining of operations staff.
e. Java Programming
On reason that the Hadoop technology was selected is the high degree of skill and experience
that ... personnel have with Java programming. As discussed earlier, interfaces into and out of
Hadoop will most likely be coded in Java. These interfaces would likely be designed,
developed, tested, and supported by ... personnel. At first blush, this statement might raise
concerns about the cost of hand-coding data interfaces, versus buying a vendor-supported
product. However, there are currently no data integration products available on the market to
perform these tasks. Furthermore, if an off-the-shelf data integration (ETL) tool like Informatica
PowerCenter could be purchased, it would still require expensive consulting services to
implement and support. Net net, programming these interfaces in Java is actually a very logical
choice for ....
5. Data Anomaly Detection
In addition, thanks to its extensive analytics capabilities and performances, Hadoop allows doing
different kind of deep analysis to define and then detect data anomaly patterns and report them in
minutes.
You’ll find attached several documents describing different anomaly approaches. In addition, there is a
lot of information available on Hadoop Wiki such as
http://wiki.apache.org/hadoop/Anomaly_Detection_Framework_with_Chukwa describing Chukwa
framework to detect anomalies.
6. Data integration/importation and Data Quality Management
As an alternative using Hadoop ETL features, Cloudera (open source editor of Hadoop) and Talend
(open source ETL tool – Extract Transform and Load) recently announced a technology partnership
http://www.cloudera.com/company/press-
center/releases/talend_and_cloudera_announce_technology_partnership_to_simplify_processing_of_la
rge_scale_data.
Talend is the recognized market leader in open source data management.
Talend’s solutions and services allow minimizing the costs and maximizing the value of data
integration, ETL, data quality and master data management.
We highly recommend using Talend as the dedicated tool for data integration, ETL and data quality.
12 ... Architecture Proposal Confidential
13. 7. Summary
Based on key factors like terabyte-scale data volumes, log files as data source, and customer-facing
OLAP, the optimal architecture for ... Reporting Portal infrastructure comprises a cloud computing
model with distributed file storage; distributed processing; optimized, relational data marts; and an
industry-leading, web-based, metadata-driven business intelligence package. The cloud computing
architecture affords ... virtually unlimited, linear scalability that can grow economically with demand.
Relational data marts ensure excellent query performance and low-risk flexibility for adding metrics,
changing reporting hierarchies, etc.
13 ... Architecture Proposal Confidential
14. Appendix A. Hadoop Overview
Due to their sheer size, large applications like ...s data warehouse require more resources than can
typically be served by a single, cost-effective machine. Even if a large, expensive server could be
configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine
could provide the continuous, uninterrupted operation needed by today’s full-time applications. The
Hadoop open-source framework—or Hadoop Common, as it is now officially known—is a Java cloud
computing architecture designed as an economical, scalable solution that provides seamless fault
tolerance for large data applications.
Hadoop is a top-level Apache Software Foundation project, being built and used by a community of
contributors from all over the world. As such, Hadoop is not a vendor-supported software package. It is a
development framework that requires in-depth programming skills to implement and maintain. Therefore,
an organization that chooses to deploy Hadoop will need to employ skilled personnel to maintain the
cluster, program MapReduce jobs, and develop input/output interfaces.
Hadoop Common runs applications on large, high-availability clusters of commodity hardware. It
implements a computational paradigm named MapReduce, where the application is divided into many
small fragments of work, each of which may be executed on any node in the cluster. In addition, Hadoop
Common provides a distributed file system (HDFS) that stores data on the compute nodes, providing very
high aggregate bandwidth across the cluster. Both MapReduce and HDFS are designed so that node
failures are automatically handled by the framework.
MapReduce
Hadoop supports the MapReduce parallel processing model, which was introduced by Google as a
method of solving a class of petabyte-scale problems with large clusters of inexpensive machines.
MapReduce is a programming paradigm that expresses a large distributed computation as a sequence
of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework
harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the
cluster. A MapReduce computation has two phases, a map phase and a reduce phase (see Figure A-1
below).
Map
In the map phase, the framework splits the input data set into a large number of fragments and
assigns each fragment to a map task. The framework also distributes the many map tasks across
the cluster of nodes on which it operates. Each map task consumes key/value pairs from its
assigned fragment and produces a set of intermediate key/value pairs. For each input key/value
pair (K,V), the map task invokes a user defined map function that transmutes the input into a
different key/value pair (K',V').
Following the map phase the framework sorts the intermediate data set by key and produces a set
of (K',V'*) tuples so that all the values associated with a particular key appear together. It also
partitions the set of tuples into a number of fragments equal to the number of reduce tasks.
Reduce
In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For
each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output
key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the
cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each
reduce task.
14 ... Architecture Proposal Confidential
15. Tasks in each phase are executed in a fault-tolerant manner. If node(s) fail in the middle of a
computation the tasks assigned to them are re-distributed among the remaining nodes. Having many
map and reduce tasks enables efficient load balancing and allows failed tasks to be re-run with small
runtime overhead.
The Hadoop MapReduce framework has a master/slave architecture comprising a single master server
or JobTracker and several slave servers or TaskTrackers, one per node in the cluster. The master
node manages the execution of jobs, which involves assigning small chunks of a large problem to many
nodes. The master also monitors node failures and substitutes other nodes as needed to pick up
dropped tasks. The JobTracker is the point of interaction between users and the framework. Users
submit MapReduce jobs to the JobTracker, which puts them in a queue of pending jobs and executes
them on a first-come, first-served basis. The JobTracker manages the assignment of map and reduce
tasks to the TaskTrackers. The TaskTrackers execute tasks upon instruction from the JobTracker and
also handle data motion between the Map and Reduce phases.
15 ... Architecture Proposal Confidential
16. Figure A-1. MapReduce Model
Input Data Set
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Split Split Split
Phase
Map
Map Task Map Task Map Task
Value0 Key3 Value6 Key1 Value1
ValueA Key7 Value2
Key5 ValueB Key2 Value7
Key2 Value8 Key2 Value3
Key6 ValueC Key4 Value9 Key4 Value4
Key8 ValueD Key8 Value5
Shuffle Shuffle
And And
Intermediate
Sort Sort
Phase
Value0 Value3
Key1
Value1 Key2 Value7
Value8
ValueA
Key3
Value6 Value4
Key4
Value9
Key5 ValueB
Key6 ValueC
Key7 Value2
Value5
Key8
ValueD
Reduce
Reduce Reduce
Phase
Task Task
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Output Data Set
Hadoop Distributed File System (HDFS)
Hadoop's Distributed File System (HDFS) is designed to reliably store very large files across clustered
machines. It is inspired by the Google File System (GFS). HDFS sits on top of the native operating
system’s file system and stores each file as a sequence of blocks. All blocks in a file except the last
block are the same size. Blocks belonging to a file are replicated across machines for fault tolerance.
The block size and replication factor are configurable per file. Files in HDFS are "write once, read
many" and have strictly one writer at any time.
16 ... Architecture Proposal Confidential
17. Like Hadoop MapReduce, HDFS follows a master/slave architecture, made up of a robust master node
and multiple data nodes (see Figure A-2 below). An HDFS installation consists of a single NameNode,
a master server that manages the file system namespace and regulates access to files by clients. In
addition, there are a number of DataNodes, one per node in the cluster, which manage storage
attached to the nodes that they run on. The NameNode makes file system namespace operations like
opening, closing, and renaming of files and directories available via an RPC interface. It also
determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and
write requests from file system clients. They also perform block creation, deletion, and replication upon
instruction from the NameNode.
Figure A-2. HDFS Model
Client
Switch
1 Gbit
Switch Switch
100 Mbit 100 Mbit
Rack Rack
JobTracker TaskTracker/
DataNode
NameNode TaskTracker/
DataNode
TaskTracker/ TaskTracker/
DataNode DataNode
TaskTracker/ TaskTracker/
DataNode DataNode
TaskTracker/ TaskTracker/
DataNode DataNode
TaskTracker/ TaskTracker/
DataNode DataNode
17 ... Architecture Proposal Confidential
18. 8. Query Optimization
Our recommendation is have a deep dive on the worst performing queries focusing on the ones running
frequently.
On the other hand moving most of the analytics from the MySQL production database to Hadoop will
reduce the data volume and the load of the MySQL database.
This will necessarily imply a performance improvement.
9. Access and Data Security
During our discussions it was mentioned some efforts would be needed to better protect and encrypt the
URL used to access the different website pages.
In addition, we’ve suggested for future use to secure the data themselves doing some encryption.
10. Internal Management and Collaboration tools
Sales Force appears to be the recommend choice in regards of its numerous management and
collaboration features. It includes all the capabilities required: Contact management; Project management
and time tracking; Technical Support Management … :
Sales Force Professional is $65 /user/month = $3,900 (2 846€) per year for 5 users
18 ... Architecture Proposal Confidential
19. 11. Sales Force and Force.com integration
In addition, Sales Force offers a complete API named Force.com allowing integrating features on your
existing platform.
This API will allow for future use an easy way to integrate new features to ... application, such as mobile
device support; interface with existing application using AppsExchange; Real-Time Analytics …
19 ... Architecture Proposal Confidential
20. 12. Roadmap
Hadoop Installation and configuration takes no more than 2 days for one person (see ―Building and
Installing Hadoop-MapReduce‖ PDF file).
We recommend taking seriously the design phase to build strong foundations of your future architecture.
Your customers Datamart should take no more than a month for a full implementation.
Regarding your internal Datamart the implantation time will depend on how deep you want to go in
analytics, however gaining experience by implementing the customer Datamart this shouldn’t be longer
than a month.
Of course, we’ll be able to assist you as needed to follow up on your future architecture implementation.
Cloudera is also providing different services on Hadoop:
Professional Services (http://www.cloudera.com/hadoop-services)
Best practices for setting up and configuring a cluster suitable to run Cloudera’s Distribution for
Hadoop:
Choice of hardware, operating system, and related systems software
Configuration of storage in the cluster, including ways to integrate with existing storage repositories
Balancing compute power with storage capacity on nodes in the cluster
A comprehensive design review of your current system and your plans for Hadoop:
Discovery and analysis sessions aimed at identifying the various data types and sources streaming
into your cluster
Design recommendations for a data-processing pipeline that addresses your business needs
Operational guidance for a cluster running Hadoop, including:
Best practices for loading data into the cluster and for ensuring locality of data to compute nodes
Identifying, diagnosing, and fixing errors in Hadoop and the site-specific analyses our customers run
Tools and techniques for monitoring an active Hadoop cluster
Advice on the integration of MapReduce job submission into an existing data-processing pipeline,
so Hadoop can read data from, and write data to, the analytic tools and databases our customers
already use
Guidance on the use of additional analytic or developmental tools, such as Hive and Pig, that offer
high-level interfaces for data evaluation and visualization
Hands-on help in developing Hadoop applications that deliver the data-processing and analysis you
need.
How to connect Hadoop to your existing IT infrastructure. We can help with moving data between
Hadoop and data warehouses, collecting data from file systems, creating document repositories,
logging infrastructure and other sources, and setting up existing visualization and analytic tools to work
with Hadoop.
Performance audits of your Hadoop cluster, with tuning recommendations for speed, throughput, and
response times
20 ... Architecture Proposal Confidential
21. Training (http://www.cloudera.com/hadoop-training)
Cloudera offers numerous on-line training resources and live public sessions:
Developer Training and Certification
Cloudera offers a three-day training program targeted toward developers who want to learn how
to use Hadoop to build powerful data processing applications.
Over three days, this course will assume only a casual understanding of Hadoop and teach you
everything you need to know to take advantage of some of the most powerful features. We’ll get
into deep details about Hadoop itself, but also devote ample time for hands-on exercises,
importing data from existing sources, working with Hive and Pig, debugging MapReduce and
much more. A full agenda is on the registration page. This course includes the certification exam
to become Cloudera Certified Hadoop Developer.
Sysadmin Training and Certification
Systems administrators need to know how Hadoop operates in order to deploy and manage
clusters for their organizations. Cloudera offers a two-day intensive course on Hadoop for
operations staff. The course describes Hadoop’s architecture, covers the management and
monitoring tools most commonly used to oversee it, and provides valuable advice on setting up,
maintaining and troubleshooting Hadoop for development and production systems. This course
includes the certification exam to become Cloudera Certified Hadoop Administrator.
HBase Training
Use HBase as a distributed data store to achieve low-latency queries and highly scalable
throughput. HBase training covers the HBase architecture, data model, and Java API as well as
some advanced topics and best practices. This training is for developers (Java experience is
recommended) who already have a basic understanding of Hadoop
21 ... Architecture Proposal Confidential