• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Database Architecture Proposal
 

Database Architecture Proposal

on

  • 3,656 views

Scalable database and IT architecture using Open Source /Cloud Computing technologies such as Hadoop, Talend and Force.com

Scalable database and IT architecture using Open Source /Cloud Computing technologies such as Hadoop, Talend and Force.com

Statistics

Views

Total Views
3,656
Views on SlideShare
2,663
Embed Views
993

Actions

Likes
3
Downloads
158
Comments
0

5 Embeds 993

http://www.mgsconsulting.fr 894
http://tgerardin.wordpress.com 81
http://mgsfr.agence-gap.fr 13
http://mgsconsulting.wordpress.com 4
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Database Architecture Proposal Database Architecture Proposal Document Transcript

    • <Client – Confidential> Architecture Proposal Prepared By Bernard Dedieu
    • Table of Content1. Background ............................................................................................................ 32. Problem Statement ................................................................................................. 33. Proposed Architecture—High-Level ........................................................................ 4a. 100GB-scale Data Volume ..................................................................................... 4b. Log Files as Data Source ........................................................................................ 4c. Customer-facing OLAP ........................................................................................... 44. Proposed Architecture—Low-Level ......................................................................... 6a. Hadoop ................................................................................................................... 6b. Data Marts .............................................................................................................. 9i. One vs. Many.......................................................................................................... 9ii. Brand of RDBMS .................................................................................................. 10c. Reporting Portal .................................................................................................... 11d. Hardware .............................................................................................................. 12e. Java Programming ................................................................................................ 125. Data Anomaly Detection ....................................................................................... 126. Data integration/importation and Data Quality Management ................................. 127. Summary .............................................................................................................. 13Appendix A. Hadoop Overview ................................................................................... 14MapReduce ................................................................................................................ 14Map............................................................................................................................. 14Reduce ....................................................................................................................... 14Hadoop Distributed File System (HDFS) ..................................................................... 168. Query Optimization ............................................................................................... 189. Access and Data Security ..................................................................................... 1810. Internal Management and Collaboration tools ....................................................... 1811. Sales Force and Force.com integration ................................................................ 1912. Roadmap .............................................................................................................. 20 2 ... Architecture Proposal Confidential
    • 1. Background<Company presentation and background – Confidential> 2. Problem Statement In term of load for the database, the number of sites is the best metric since it describes the number …. So it is very important that the web-application remains effective as the company is growing (this includes the database, the framework and the architecture of the servers). Also, as the company grows in … need to deploy a server in Europe to manage ... In addition, the historical data will be kept and the number of ... will grow the data volume will grow exponentially. So the overall database architecture needs to be highly and easily scalable. It is also more than likely as the solution price will decrease, bigger corporations will be interested in ... solution. Therefore, ... solution will need to be integrated in existing information systems. This will require:  To interface ... solution to existing applications.  To have ... solution relying on standard and open technologies.  To build partnership with System Integrators or build an internal Professional Services organization to support these customers. With its current, somewhat limited database schema, the data warehouse’s millions of records consume more than 2GB of disk space, including indexes. Extensions to the data warehouse schema, coupled with a growing customer base, will easily push the data warehouse volume beyond 100GB. The single instance, multi-schema MySQL database architecture simply does not provide the scalability necessary to meet ... demands. In addition to these scalability problems, the reporting infrastructure is also limited in its potential for enhanced functionality. For instance, ... would like to extend the Reporting Portal to provide customers with ad-hoc, multi-dimensional query capability and custom reporting based on searchable attribute tags in the data warehouse. At present, the data warehouse dimensions do not provide the flexibility needed to easily accommodate these kinds of changes. Therefore, ... has a pressing need to replace its current reporting infrastructure with a scalable, flexible architecture that can not only accommodate their growing data volumes, but also dramatically extend their reporting functionality. Key goals for the new infrastructure include:  Redundant, efficient retention of historical detail o Write once, ready many o Compression o No encryption required o ANSI-7 single-byte code page is sufficient  Linear scalability (i.e., as data volume increases, performance is not degraded)  Flexible extensibility (e.g., attributes can easily be added and exposed to customers for reporting, either as dimensional attributes or fact attributes) 3 ... Architecture Proposal Confidential
    •  Full OLAP support o Standard reports o Custom reports o Ad-hoc query o Multi-dimensional o Hierarchical categories (i.e., tagging, snowflakes) o Charts and graphs o Drill-down to atomic detail (i.e., ... log) o 24x7 availability o Query response time measured in seconds (not minutes)  Efficient ETL o Near real time (i.e., < 15 minutes) o Handles fluctuating volumes throughout the day without becoming a bottleneck (which can cause synchronization problems in the data warehouse)  Partitioning of data by customerThis new architecture must deliver vastly improved functionality, while controlling for implementation costand time to roll-out.3. Proposed Architecture—High-Level From an architectural perspective, there are three overarching factors driving the technical solution for ... reporting needs: a. 100GB-scale Data Volume Due to their sheer size, large applications like ...s data warehouse require more resources than can typically be served by a single, cost-effective machine. Even if a large, expensive server could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine could provide the continuous, uninterrupted operation needed to meet ... SLAs. A cloud computing architecture, on the other hand, is an economical, scalable solution that provides seamless fault tolerance for large data applications. b. Log Files as Data Source More and more organizations are seeking to leverage the rich content in their verbose log files to drive business intelligence. Sourcing from log files presents a different set of challenges compared to selecting data out of a highly structured OLTP database. Efficient, robust, and flexible parsing routines most be programmed to identify tagged attributes and map these to business constructs in the data warehouse. And because log files tend to consume lots of disk space, they should ideally be stored in a distributed file system in order to load balance I/O and improve fault tolerance. c. Customer-facing OLAP The stakes are usually higher when building and maintaining a customer-facing business intelligence solution, as opposed to one that is implementing internally. ... reputation and marketability depend in part on its customers’ opinions of the Reporting Portal. It must be intuitive, easy to use, powerful, secure, and available anytime. Its data should be as fresh as possible, while providing historical data for trend analyses. Customers should have seamless access to both aggregated metrics and ... log detail. The Reporting Portal should expose the customizability of the speech application through its reports. Any customer-specific categories, tags, and data content should be faithfully reflected in the Reporting Portal, just as the customer would expect to see them. 4 ... Architecture Proposal Confidential
    • Based on these driving factors, we propose a cloud computing architecture comprising a distributed filesystem, distributed file processing, one or more relational data marts, and a browser-based OLAPpackage (see Figure 1). Most of this infrastructure will be built using open source softwaretechnologies running on commodity hardware. This strategy keeps initial implementation costs low fora right-sized solution, while providing a path for scalable growth.Figure 1. High-Level Architecture ... logs Hadoop Distributed Relational Reporting File System (HDFS) Data Mart(s) Portal ... logs are retained ...logs are Any portion of Reports, ad- for ever (or as immediately historical data can hoc queries, otherwise specified replicated into be read from graphs, and per customer HDFS and can be Hadoop and charts are requirements). retained indefinitely. aggregated as needed into presented via optimized reporting browser-based database(s). software.In this design, Apache Hadoop (http://hadoop.apache.org/) is used to perform some of the functionsnormally provided by a relational data warehouse. Most specifically, Hadoop behaves as the system ofrecord, storing all of the historical detail generated by the Speech Applications. New ... logs areimmediately replicated into the Hadoop Distributed File System (HDFS), which is massively scalable toaccommodate virtually any amount of data. HDFS is based on Google’s GFS, which essentially storesthe content of the Web in order to facilitate index generation. Other well-known companies that storehuge volumes on data in HDFS include Yahoo!, AOL, Facebook, and Amazon. Hadoop is free todownload and install. It uses a cloud computing architecture (i.e., lots of inexpensive computers linkedtogether, sharing workload), so it can be easily and economically extended as needed to scale forgrowth. Scaling performance is linear; performance does not degrade as you increase data volume.Hadoop cannot fulfill all of the functions of a data warehouse, though. For instance, it does not containindexes like a relational database, so it can’t truly be optimized to return query results quickly. Hadoopprovides a very powerful, distributed job processing technology called MapReduce, which can performmuch of the extract and transform work that is commonly done by ETL tools. Therefore, Hadooppowerfully augments ... business intelligence architecture by using distributed storage and processingto perform the data warehousing functions that would otherwise be the hardest to scale under atraditional, single-machine, relational data warehouse architecture.5 ... Architecture Proposal Confidential
    • While Hadoop does the ―heavy lifting,‖ other, more traditional technologies are used to provide familiarbusiness intelligence functionality. Relational data marts serve up optimized OLAP database schemas(e.g., highly indexed star schemas) for querying via standard business intelligence tools. One definingfactor of a data mart is that it can be completely truncated and reloaded from the upstream datarepository (in this case, Hadoop) as needed. This means that if ... needs to enhance the reportingdatabase design by altering a dimension or adding new metrics, the data mart’s schema can altered—even dramatically—and repopulated without the risk of losing any historical data. It’s also worth notingthat because the Hadoop repository stores all historical detail, it is possible to retroactively back-populate new metrics that re added to the data mart(s).As of this writing, it is not know how much data volume must be accommodated in a given data mart.And we don’t yet know whether one data mart would suffice, or if there would be many data marts.These questions will influence the choice of relational database management system (RDBMS) that isselected for .... For example, MySQL is cheap to procure and implement, but has serious scalabilitylimitations. A columnar MPP database like ParAccel is ideal for handling multi-terabyte data volumes,but comes with a price tag. One advantage of this proposed architecture, though, is that the data martscan be migrated from one technology to another without risk of losing valuable data.The customer-facing front end technology should be a mature, fully-supported product likeBusinessObjects or MicroStrategy. Such technologies are rich with features that would otherwise bevery costly to develop in-house, even with open source Java libraries. Besides, the customers who usethis interface should not become quality assurance testers for internally developed user interfaces. TheReporting Portal is a marketed service and as such, must leave customers with a great impression.4. Proposed Architecture—Low-LevelThis section outlines an in-depth look at each component in Figure 1 above. a. Hadoop Hadoop is an extremely powerful open source technology that does certain things very well, like store immense volumes of data and perform distributed computations on that data. Some of these strengths can be leveraged within the context of a business intelligence application. For instance, several of the functions that would normally be performed within a traditional data warehouse could be taken up by Hadoop. One defining feature of a data warehouse is that it stores historical data. While source systems may only keep a rolling window of recent data, the data warehouse retains all or most of the history. This frees up the transactional systems to efficiently run the business, while keeping a historical system of record in the data warehouse. HDFS is ideal for archiving large volumes of static data, such as ... ... logs. HDFS provides linear scalability as data volumes increase. Not only can HDFS easily handle ... for ever retention requirement, but if could also permit ... to retain all of its history. HDFS comfortably scales into the petabyte range, so the need to age out and purge files could be eliminated altogether. Hadoop is a perfect solution for historicity problems, because it easily scales to petabyte sizes by simply by configuring additional hardware into the cluster. Another benefit of HDFS is its data redundancy. HDFS replicates file blocks across nodes, which can physically reside in the same data center or in another data center (assuming the VPN bandwidth supports it). This would entirely eliminate the need for ... to copy zipped ... log files between data centers (see Figure 2).6 ... Architecture Proposal Confidential
    • Figure 2. ... Log-Hadoop Architecture VXML … Logs or or or or or or or or or or or or or or or R R R R R R R R R R R R R R R e e e e e e e e e d e d e d e d e d e d e d d d d d d d d d c c c c c c c c c c c c c c c S S S V V KV KV KV KpV KpV KpV RJ R T Klial KliV Klial V e V K al K al K al V V e al e al S al e V Na o a T e KSV e al K e al e tal e tal e tal al e u e u u Vy u y u V a y K Vy uK y u u y u hu u y hu y K al V y V A Java program a T b ck T ck a y e al 2 e K 1 u uV uV y V e al 2 e e 7 u al y 5 Kffl 3 e Kffl e e al e 4 ee 2 e V al e al V T s T 6 eRu 6 eRu e e al 8 e al e u y u 7e 8 e 0 V al e 4 1 al 2 Aeu 8 y eal 3 reads each ... log ma a u u or or or or or or or or or or T y eu 9 y eal 4 B R R R R R R R R R R s e e e e e e e d e d e d e d d d d d d d al 5 c c c c c c c c c c k Cu u T a e T a y e D 1 e y e 2 u y e yAeu e and writes it into s Tr s k T 3Ae 5d0e 4du 6 e u e 3 e HDFS for Na s a s 7 A nd 1 e 6 8 7 nd8 e e 4 k HDFS T a a k a uB 2 9 uC permanent s k s kr MapReduce S S5D o T T c(Figure A-2) s sr c c (Figure A-1) storage. k T a k T ort e ort e d kr a kr Tr c Tr T T e a e T a c T a a ar k ar cr r c kr s s a c e a c k k k The a k e Hadoop Distributed File System (HDFS) can a c k r/ c k transparently replicate data e c e be configuredr/ cto k e D k e r/ k r/ across racks D across data centers, providing and k e r/ a e r/ redundant failover copies of all file blocks. De D a e Dr/ t r/ D a r/ at r/ Da a D a D t a Dt at N atAlthough business intelligence solutions depend on lots of data, business users are interested in a a a N a at ot a of raw data into meaningful business metrics,information. In order to transform large volumesN N t o dt a a N must be applied, and large numbers of data elements Ncalculations must be performed, business rules o a o d a No e N omust be summarized into a few figures. Nd d e N o d o d e o e o d e d eTraditionally, this type of aggregation worke d is done outside of the data warehouse by an extract, d e e etransform, and load (ETL) tool, or within the data warehouse using stored procedures and materializedviews. Due to the inherent constraints imposed by a relational database system like MySQL, there arelimits to how much data can reasonably be aggregated this way. As source data volumes increase, thetime required to perform aggregations can extend beyond the point in time when the resulting metricsare needed by the customers.Hadoop is able to perform these kinds of aggregations much quicker on large data volumes because itdistributes the processing across many computers, each one crunching the numbers for a subset of thesource data. Consequently, aggregated metrics that might have taken days to calculate in a traditionaldata warehouse model can be churned out by Hadoop in a couple of hours or even minutes.MapReduce is particularly well-suited to structured data sets like ... ... logs. Tagged attributes mapeasily to key/value pairs, which are the transactional unit of MapReduce jobs (see Figure A-1 inappendix). ... ETL routines could therefore be replaced with Java MapReduce jobs read from HDFS ...log files and write to the data marts (see Figure 3).7 ... Architecture Proposal Confidential
    • Figure 3. Hadoop MapReduce Architecture Other Tools … Relational or or or or or or or or or or or or or or or R R R R R R R R R R R R R R R e e e e e e e e e d e d e d e d e d e d e d d d d d d d d d c c c c c c c c c c c c c c c S S S Data Mart(s) RJ R KpV KpV KpV V KV KV V KV KV K al K al K al V e V K al T Klial V e V e V e al e lial S lial N ao a T e KSV e al K e al e tal e tal e tal al y u e u u Vy u y u V a y K Vy uK y u u y u hu uhu y K al V y V T cka b T ck a y e al 2 e K 1 u uV uV e al e 4 ee 2 e y V e al 2 e e 7 u al y 5 Kffl 3 e Kffl e V al e al T s T 6 eRu 6 eRV e e al 8 e al e u y u 7e 8 e 0 V al e 4 1 al 2 u Aeu 8 y eal 3 JDBC ma a u u or or or or or or or or or or T y eu 9 y eal 4 B R R R R R R R R R R s e e e e e e e d e d e d e d d d d d d d al 5 c c c c c c c c c c T k T Cu y e D u u e a a 1 e y e y u 2 e Te sr s k T 3Ae 5d0e 4de yAu 6 3 u e e e Na s a s 7 A nd 1 e 6 8 e nd 7 8 e 4 k HDFS T a a k a uB2 uC9 s k s kr MapReduce S S5D To T c(Figure A-2) s sr c c (Figure A-1) Tk a k T ort e ort e d kr a kr T r c Tr T T Java programs execute Te a e a c T a a ar k ar MapReduce jobs to extract and c The ar r c entire history of ... logs is permanentlysstored kr s c e a c k k transform any subset of ... log data, k a k e a in Hadoop, making it possible to back-populate c k r/ c k and then write the aggregated newe metrics with old data, perform year-over- c kBI D e r/ c k results into the relational data marts e e r/trend reports, and manually mine data as yeark r/ e Dk a e via JDBC. r/ D needed. r/ D e r/ t r/ a e D a D a r/ D a Dt r/ at at D a D quite a few maturing open source tools that can provide analysts direct access to NThere a at a are also at at N a otHadoop data. N a N a oFor instance, a desktop tool like HBase or Hive can be used as a SQL-like interface into at d atHadoop, N permitting analysts to run queries in much the same way that they would access a traditional o N o a N d a e N o odata warehouse. These tools might be useful to ... personnel who want to perform analyses that are d d N o e N o d e d enot immediately available through the Reporting Portal. Such tools are best suited for more technically o d o d e eliteratedanalysts who are comfortable writing their own queries and do not require fast query response e d etime. e eCloudera (http://www.cloudera.com/) recently unveiled its browser-based Cloudera Desktop product.This tool simplifies some of the work required to set up, execute, and monitor MapReduce jobs. For themore technically inclined analysts in ... organization, Cloudera Desktop might be a good fit—even betterthan one of the SQL emulators like HBase. Cloudera Desktop’s main features include:  File Browser – Navigate the Hadoop file system  Job Browser – Examine MapReduce job states  Job Designer – Create MapReduce job designs  Cluster Health – At-a-glance state of the Hadoop clusterIt is also possible to use Hadoop’s MapReduce to generate ―canned reports‖ in batch processing mode.That is, nightly batch jobs can be scheduled to produce static reports. These reports would consumedata directly from Hadoop, and the resulting content could be pre-formatted for presentation via HTML.Such reports would effectively by-pass the relational data mart altogether.8 ... Architecture Proposal Confidential
    • b. Data Marts Stated simply, Hadoop can make an excellent contribution as a component of a business intelligence solution, but it cannot be the whole solution. A key limitation is that a data warehouse is indexed to provide fast query response time, while Hadoop data is not. A data warehouse (or data mart) typically contains pre-aggregated metrics in order to deliver selected results as fast as possible (i.e., without re-aggregating on the fly). Therefore, a gating factor in deciding whether to run analytic queries and reports against Hadoop is the end user’s expectation for response time. Since ... customers expect and deserve immediate to near- immediate query performance, directly querying Hadoop is not a viable design for the Reporting Portal. It’s also worth noting here that most of the mature, industry-standard OLAP tools like BusinessObjects and MicroStrategy cannot be coupled directly with Hadoop. Therefore, the ... reporting infrastructure will still require a traditional, relational, indexed data store containing pre- aggregated metrics. This data store is rightly called a data mart, because it is not the historical repository of detailed data, or system of record. All of its content can be regenerated at any time from the upstream data source. ... has two basic architectural decisions to make with regard to the data mart. First is whether to create one data mart or multiple data marts. The second decision is which brand of RDBMS to implement. i. One vs. Many There are a couple of compelling reasons to implement multiple, separate data marts. One reason is performance. The less data you cram into a relational database, the faster it generally performs. There can be exceptions to this rule (like ParAccel’s Analytic Database), but relational databases are usually more responsive with smaller data volumes. A second motivation for splitting ... data into multiple marts is security. It’s certainly quite possible to implement robust security within a single relational database instance, but physically separating each customer’s data definitely ensures that they cannot see one another’s content. However, it is strongly recommended that ... not rely solely on physical separation to enforce data security. There might be situations in which it is not economical store lots of small customers’ data separately. ... should retain the option to co-mingle multiple customers’ data in one database instance, while ensuring privacy to each of them.9 ... Architecture Proposal Confidential
    • Figure 4. Multiple Data Marts Relational Data Marts Customer A Customer B Customer C System of record contains all historical detail. … A third reason for implementing multiple data marts is customizability. It’s quite possible that Customer A might require different kinds of metrics from what Customer B needs. One data mart would have to be all things to all customers, making it horribly complex. The turnaround time required to add customer-specific metrics would be greatly improved by hosting them in a dedicated data mart. Having multiple data marts would be very similar to ... current reporting architecture, which uses dedicated MySQL schemas to partition customer data. ii. Brand of RDBMS There are several factors influencing ... choice of relational database management system. The primary factor will likely be data volume, which itself is influenced by many factors (e.g., data model, historical timeframe, individual customer’s ... log volume). Therefore, within the context of this proposal, it is not possible to accurately estimate data sizing. Instead, we can provide some basic guidance for future reference. From our experience, relatively small volumes (i.e., 10s of GB or less) can be comfortably accommodated by MySQL. Medium volumes (up to 100s of GB) are better served by Microsoft SQL Server or Oracle. Large volumes 100s of GB to TB-scale) require a columnar MPP database like ParAccel Analytical Database, Netezza, Teradata, Exadata, or Vertica. In addition to data volumes, ... will likely consider cost. MySQL is free, while other products can costs hundreds of thousands of dollars to purchase. The cost of a given RDBMS may also depend in part of the hardware needed to support it. Some RDBMS products only run on certain brands of hardware. Clearly, this can have far-reaching ramifications for ... costs of operations. We recommend that ... choose database software that can run on any Intel-powered, rackable server. Such hardware will provide the most economical scalability path.10 ... Architecture Proposal Confidential
    • Table 1. RDBMS Recommendations Data Volume Brand Notes Up to 10s of GB MySQL Free, but doesn’t scale well Up to 100s of Good value for money, easy to run on Microsoft SQL Server GB commodity hardware 100s of GB to ParAccel Analytic Powerful, hardware-flexible, negotiable TB Database pricing model c. Reporting Portal ... next generation Reporting Portal could provide its customers with a greatly expanded set of features if it is replaced with an industry-standard business intelligence tool like BusinessObjects or MicroStrategy. The choice of such tool will be essentially driven by how ... customers needs change and more importantly if ... start to have bigger corporations with existing IT architecture as client. On the short and middle term, an open source tools such as DataVision http://datavision.sourceforge.net will be a perfect solution allowing producing custom reports easily and generating the result using XML format. The XML format will allow to distribute the report almost Operating System agnostic. The only requirement will be to have XML file reading capabilities on the platform the reports needs to be visualized. These web-based tools leverage the power of metadata to enforce security and map business metrics to back-end data structures. A metadata-based tool flexibly supports business abstractions like categories and hierarchies that are not inherent to the physical data. Business intelligence tools offer a rich presentation layer capable of displaying the graphs, charts, and pivot tables that business users have come to expect from reporting interfaces. Figure 5. Browser-based Front-end Relational Data Marts BI Web Server Customer’s Browser Internet Vendor supported business intelligence ... Network application provides richly featured, web- based interface. Customers can run BI Metadata standard and customer reports, ad-hoc Repository queries, generate charts and graphs, save results to Excel, etc.11 ... Architecture Proposal Confidential
    • By leveraging a mature front-end technology, ... gains the advantage of reducing its internal Java development effort, while giving its customers a greatly expanded set of reporting and OLAP functionality. There a many products on the market, some cheaper and less mature than the long-standing industry leaders, Business Objects XI 3.1 and Micro Strategy 9. Our recommendation to ... is to be willing to invest in this customer-facing component so that it reinforces the most appealing impression in its end users. d. Hardware All of the technologies outlined thus far will run quite well on the type of hardware that ... currently uses to serve the Reporting Portal’s data warehouse. ... could purchase several more of the rackable Dell PowerEdge 2950 server trays running Windows Server 2003 and array them as a Hadoop cluster, data mart hosts, or web servers. Operational considerations like data center space and power notwithstanding, this hardware choice would preserve ... current SOE (standard operating environment), and minimize retraining of operations staff. e. Java Programming On reason that the Hadoop technology was selected is the high degree of skill and experience that ... personnel have with Java programming. As discussed earlier, interfaces into and out of Hadoop will most likely be coded in Java. These interfaces would likely be designed, developed, tested, and supported by ... personnel. At first blush, this statement might raise concerns about the cost of hand-coding data interfaces, versus buying a vendor-supported product. However, there are currently no data integration products available on the market to perform these tasks. Furthermore, if an off-the-shelf data integration (ETL) tool like Informatica PowerCenter could be purchased, it would still require expensive consulting services to implement and support. Net net, programming these interfaces in Java is actually a very logical choice for ....5. Data Anomaly DetectionIn addition, thanks to its extensive analytics capabilities and performances, Hadoop allows doingdifferent kind of deep analysis to define and then detect data anomaly patterns and report them inminutes.You’ll find attached several documents describing different anomaly approaches. In addition, there is alot of information available on Hadoop Wiki such ashttp://wiki.apache.org/hadoop/Anomaly_Detection_Framework_with_Chukwa describing Chukwaframework to detect anomalies.6. Data integration/importation and Data Quality ManagementAs an alternative using Hadoop ETL features, Cloudera (open source editor of Hadoop) and Talend(open source ETL tool – Extract Transform and Load) recently announced a technology partnershiphttp://www.cloudera.com/company/press-center/releases/talend_and_cloudera_announce_technology_partnership_to_simplify_processing_of_large_scale_data.Talend is the recognized market leader in open source data management.Talend’s solutions and services allow minimizing the costs and maximizing the value of dataintegration, ETL, data quality and master data management.We highly recommend using Talend as the dedicated tool for data integration, ETL and data quality.12 ... Architecture Proposal Confidential
    • 7. SummaryBased on key factors like terabyte-scale data volumes, log files as data source, and customer-facingOLAP, the optimal architecture for ... Reporting Portal infrastructure comprises a cloud computingmodel with distributed file storage; distributed processing; optimized, relational data marts; and anindustry-leading, web-based, metadata-driven business intelligence package. The cloud computingarchitecture affords ... virtually unlimited, linear scalability that can grow economically with demand.Relational data marts ensure excellent query performance and low-risk flexibility for adding metrics,changing reporting hierarchies, etc.13 ... Architecture Proposal Confidential
    • Appendix A. Hadoop OverviewDue to their sheer size, large applications like ...s data warehouse require more resources than cantypically be served by a single, cost-effective machine. Even if a large, expensive server could beconfigured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machinecould provide the continuous, uninterrupted operation needed by today’s full-time applications. TheHadoop open-source framework—or Hadoop Common, as it is now officially known—is a Java cloudcomputing architecture designed as an economical, scalable solution that provides seamless faulttolerance for large data applications.Hadoop is a top-level Apache Software Foundation project, being built and used by a community ofcontributors from all over the world. As such, Hadoop is not a vendor-supported software package. It is adevelopment framework that requires in-depth programming skills to implement and maintain. Therefore,an organization that chooses to deploy Hadoop will need to employ skilled personnel to maintain thecluster, program MapReduce jobs, and develop input/output interfaces.Hadoop Common runs applications on large, high-availability clusters of commodity hardware. Itimplements a computational paradigm named MapReduce, where the application is divided into manysmall fragments of work, each of which may be executed on any node in the cluster. In addition, HadoopCommon provides a distributed file system (HDFS) that stores data on the compute nodes, providing veryhigh aggregate bandwidth across the cluster. Both MapReduce and HDFS are designed so that nodefailures are automatically handled by the framework. MapReduce Hadoop supports the MapReduce parallel processing model, which was introduced by Google as a method of solving a class of petabyte-scale problems with large clusters of inexpensive machines. MapReduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases, a map phase and a reduce phase (see Figure A-1 below). Map In the map phase, the framework splits the input data set into a large number of fragments and assigns each fragment to a map task. The framework also distributes the many map tasks across the cluster of nodes on which it operates. Each map task consumes key/value pairs from its assigned fragment and produces a set of intermediate key/value pairs. For each input key/value pair (K,V), the map task invokes a user defined map function that transmutes the input into a different key/value pair (K,V). Following the map phase the framework sorts the intermediate data set by key and produces a set of (K,V*) tuples so that all the values associated with a particular key appear together. It also partitions the set of tuples into a number of fragments equal to the number of reduce tasks. Reduce In the reduce phase, each reduce task consumes the fragment of (K,V*) tuples assigned to it. For each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each reduce task. 14 ... Architecture Proposal Confidential
    • Tasks in each phase are executed in a fault-tolerant manner. If node(s) fail in the middle of acomputation the tasks assigned to them are re-distributed among the remaining nodes. Having manymap and reduce tasks enables efficient load balancing and allows failed tasks to be re-run with smallruntime overhead.The Hadoop MapReduce framework has a master/slave architecture comprising a single master serveror JobTracker and several slave servers or TaskTrackers, one per node in the cluster. The masternode manages the execution of jobs, which involves assigning small chunks of a large problem to manynodes. The master also monitors node failures and substitutes other nodes as needed to pick updropped tasks. The JobTracker is the point of interaction between users and the framework. Userssubmit MapReduce jobs to the JobTracker, which puts them in a queue of pending jobs and executesthem on a first-come, first-served basis. The JobTracker manages the assignment of map and reducetasks to the TaskTrackers. The TaskTrackers execute tasks upon instruction from the JobTracker andalso handle data motion between the Map and Reduce phases.15 ... Architecture Proposal Confidential
    • Figure A-1. MapReduce Model Input Data Set Record Record Record Record Record Record Record Record Record Record Record Record Record Record Record Split Split Split Phase Map Map Task Map Task Map Task Value0 Key3 Value6 Key1 Value1 ValueA Key7 Value2 Key5 ValueB Key2 Value7 Key2 Value8 Key2 Value3 Key6 ValueC Key4 Value9 Key4 Value4 Key8 ValueD Key8 Value5 Shuffle Shuffle And And Intermediate Sort Sort Phase Value0 Value3 Key1 Value1 Key2 Value7 Value8 ValueA Key3 Value6 Value4 Key4 Value9 Key5 ValueB Key6 ValueC Key7 Value2 Value5 Key8 ValueD Reduce Reduce Reduce Phase Task Task Record Record Record Record Record Record Record Record Record Record Output Data SetHadoop Distributed File System (HDFS)Hadoops Distributed File System (HDFS) is designed to reliably store very large files across clusteredmachines. It is inspired by the Google File System (GFS). HDFS sits on top of the native operatingsystem’s file system and stores each file as a sequence of blocks. All blocks in a file except the lastblock are the same size. Blocks belonging to a file are replicated across machines for fault tolerance.The block size and replication factor are configurable per file. Files in HDFS are "write once, readmany" and have strictly one writer at any time.16 ... Architecture Proposal Confidential
    • Like Hadoop MapReduce, HDFS follows a master/slave architecture, made up of a robust master nodeand multiple data nodes (see Figure A-2 below). An HDFS installation consists of a single NameNode,a master server that manages the file system namespace and regulates access to files by clients. Inaddition, there are a number of DataNodes, one per node in the cluster, which manage storageattached to the nodes that they run on. The NameNode makes file system namespace operations likeopening, closing, and renaming of files and directories available via an RPC interface. It alsodetermines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read andwrite requests from file system clients. They also perform block creation, deletion, and replication uponinstruction from the NameNode.Figure A-2. HDFS Model Client Switch 1 Gbit Switch Switch 100 Mbit 100 Mbit Rack Rack JobTracker TaskTracker/ DataNode NameNode TaskTracker/ DataNode TaskTracker/ TaskTracker/ DataNode DataNode TaskTracker/ TaskTracker/ DataNode DataNode TaskTracker/ TaskTracker/ DataNode DataNode TaskTracker/ TaskTracker/ DataNode DataNode17 ... Architecture Proposal Confidential
    • 8. Query OptimizationOur recommendation is have a deep dive on the worst performing queries focusing on the ones runningfrequently.On the other hand moving most of the analytics from the MySQL production database to Hadoop willreduce the data volume and the load of the MySQL database.This will necessarily imply a performance improvement.9. Access and Data SecurityDuring our discussions it was mentioned some efforts would be needed to better protect and encrypt theURL used to access the different website pages.In addition, we’ve suggested for future use to secure the data themselves doing some encryption.10. Internal Management and Collaboration toolsSales Force appears to be the recommend choice in regards of its numerous management andcollaboration features. It includes all the capabilities required: Contact management; Project managementand time tracking; Technical Support Management … :Sales Force Professional is $65 /user/month = $3,900 (2 846€) per year for 5 users 18 ... Architecture Proposal Confidential
    • 11. Sales Force and Force.com integrationIn addition, Sales Force offers a complete API named Force.com allowing integrating features on yourexisting platform.This API will allow for future use an easy way to integrate new features to ... application, such as mobiledevice support; interface with existing application using AppsExchange; Real-Time Analytics … 19 ... Architecture Proposal Confidential
    • 12. RoadmapHadoop Installation and configuration takes no more than 2 days for one person (see ―Building andInstalling Hadoop-MapReduce‖ PDF file).We recommend taking seriously the design phase to build strong foundations of your future architecture.Your customers Datamart should take no more than a month for a full implementation.Regarding your internal Datamart the implantation time will depend on how deep you want to go inanalytics, however gaining experience by implementing the customer Datamart this shouldn’t be longerthan a month.Of course, we’ll be able to assist you as needed to follow up on your future architecture implementation.Cloudera is also providing different services on Hadoop:Professional Services (http://www.cloudera.com/hadoop-services) Best practices for setting up and configuring a cluster suitable to run Cloudera’s Distribution for Hadoop:  Choice of hardware, operating system, and related systems software  Configuration of storage in the cluster, including ways to integrate with existing storage repositories  Balancing compute power with storage capacity on nodes in the cluster A comprehensive design review of your current system and your plans for Hadoop:  Discovery and analysis sessions aimed at identifying the various data types and sources streaming into your cluster  Design recommendations for a data-processing pipeline that addresses your business needs Operational guidance for a cluster running Hadoop, including:  Best practices for loading data into the cluster and for ensuring locality of data to compute nodes  Identifying, diagnosing, and fixing errors in Hadoop and the site-specific analyses our customers run  Tools and techniques for monitoring an active Hadoop cluster  Advice on the integration of MapReduce job submission into an existing data-processing pipeline, so Hadoop can read data from, and write data to, the analytic tools and databases our customers already use  Guidance on the use of additional analytic or developmental tools, such as Hive and Pig, that offer high-level interfaces for data evaluation and visualization Hands-on help in developing Hadoop applications that deliver the data-processing and analysis you need. How to connect Hadoop to your existing IT infrastructure. We can help with moving data between Hadoop and data warehouses, collecting data from file systems, creating document repositories, logging infrastructure and other sources, and setting up existing visualization and analytic tools to work with Hadoop. Performance audits of your Hadoop cluster, with tuning recommendations for speed, throughput, and response times 20 ... Architecture Proposal Confidential
    • Training (http://www.cloudera.com/hadoop-training)Cloudera offers numerous on-line training resources and live public sessions:Developer Training and Certification Cloudera offers a three-day training program targeted toward developers who want to learn how to use Hadoop to build powerful data processing applications. Over three days, this course will assume only a casual understanding of Hadoop and teach you everything you need to know to take advantage of some of the most powerful features. We’ll get into deep details about Hadoop itself, but also devote ample time for hands-on exercises, importing data from existing sources, working with Hive and Pig, debugging MapReduce and much more. A full agenda is on the registration page. This course includes the certification exam to become Cloudera Certified Hadoop Developer.Sysadmin Training and Certification Systems administrators need to know how Hadoop operates in order to deploy and manage clusters for their organizations. Cloudera offers a two-day intensive course on Hadoop for operations staff. The course describes Hadoop’s architecture, covers the management and monitoring tools most commonly used to oversee it, and provides valuable advice on setting up, maintaining and troubleshooting Hadoop for development and production systems. This course includes the certification exam to become Cloudera Certified Hadoop Administrator.HBase Training Use HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. HBase training covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices. This training is for developers (Java experience is recommended) who already have a basic understanding of Hadoop21 ... Architecture Proposal Confidential