O appliance IBM Netezza consiste em uma plataforma de banco de dados otimizado e integrado a um hardware de alta performance. Uma nova forma de análise em grandes volumes de dados é fator chave de competitividade para as empresas.
As soluções tradicionais de datawarehouse são rígidas, complexas, lentas e caras, inibindo a agilidade na tomada de decisões. O IBM Netezza é uma plataforma que foi desenvolvida para endereçar exatamente este problema: uma solução de performance desruptiva, administração muito simples e custo reduzido, utilizado para datawarehouses de grande volume e complexidade.
Ele integra banco de dados, servidor, e discos de armazenamento em um único rack. Sua arquitetura de processamento massivo paralelo combina blades de processamento, discos e um processo de filtragem de dados através de software armazenado diretamente em chips (FPGAs: field-programmable gate arrays). Este é o diferencial de performance do IBM Netezza quando comparado aos concorrentes: sua arquitetura é única e inteligente, devido ao uso de FPGAs com software gravado diretamente em silício, lê menos dados e também move menos dados entre os componentes internos, eliminando desperdícios e gargalos de processamento. Esta arquitetura dispensa tuning, índices, particionamento, etc. tornando a administração simples, permitindo que a equipe técnica tenha mais tempo para trabalhar em projetos de negócio ao invés de gastar tempo em atividades técnicas/administrativas que não geram valor agregado.
Veja o webcast no link http://www.videolog.tv/devworksbr/videos/716598
O desenvolvimento é um conceito mais amplo, pode ter um contexto biológico ou...
Datawarehouse - Obtenha insights consistentes para o seu negócio: conheça o novo appliance IBM Netezza
1. Obtenha insights consistentes para o seu negócio: Conheça o novo appliance IBM Netezza Christiano Hage (chrishg@br.ibm.com) Outubro de 2011
2.
3. Você ainda usaria o Google se fossem necessários 3 Dias e 5 Pessoas para obter o resultado? Information Management
4. ” “ - Gartner 2010 Magic Quadrant Gerenciamento complexo Exige recursos especializados Tuning constante Dias para uma única consulta Information Management Aproximadamente 70% dos Data Warehouses apresentam limitações de performance.
5. Information Management Os dados continuam a expandir exponencialmente. Análises de negócios estão se tornando mais complexas a medida que o negócio demanda respostas mais rápidas. O data warehouse agora é de missão crítica.
9. Simplicidade e compatibilidade com ferramentas do mercado ODBC 3.X JDBC Type 4 SQL-92 SQL-99 Analytics Appliance IBM Netezza 1000 Sistemas Fonte Clientes High Performance Loader 3rd Party Apps DBA CLI ETL Server SOLARIS LINUX HP-UX AIX WINDOWS TRU64
10. A abordagem revolucionária do IBM Netezza ” “ Análises de negócio mais simples, rápidas e acessíveis É o que o Netezza fez no mercado de data warehousing: Ele mudou totalmente a forma como pensamos data warehousing. - Philip Howard, Bloor Research O Appliance
11. Page Mídia Bancos e Serviços Financeiros Governo Saúde & Life Sciences Varejo / Indústria Telecom Outros
12. O verdadeiro appliance fornece ” “ implementação muito mais rápida e fácil Nos enviaram um appliance, nós o colocamos em nosso data center e o plugamos em nossa rede. Em 24 horas o appliance estava rodando. Não estou exagerando, foi tão fácil como estou dizendo. - Joseph Essas, Vice President e de Tecnologia, eHarmony e Harmony
13. ” “ velocidade para transformar o negócio - SVP Application Development, Nielsen O verdadeiro appliance fornece … quando as análises demoravam 24 horas eu ficava limitado, mas agora que elas demoram segundos, eu posso repensar o meu negócio completamente…
14.
15. ” “ menor custo total de propriedade - Mark Saponar, CIO, iBasis O verdadeiro appliance fornece O nosso time de data warehouse consiste em um ou dois colaboradores que precisam fazer apenas pequenas alterações a cada três meses.
16. O Appliance IBM Netezza 1000 Banco de dados de alta Performance, joins, agregações, sorts, etc. Compilador SQL Query Plan Otimizador Admininistração Processadores & RAM e FPGAs Dados de usuário Espelhamento Temp/Swap SMP Hosts S-Blades™ (FPGA-based Database Accelerator) Discos/Storage
18. Os componentes da S-Blade™ Intel Quad-Core 2+ GHz CPU Dual-Core FPGA 125 MHz 32 GB DRAM IBM BladeCenter Server Netezza DB Accelerator SAS Expander Module
19. CPU Requisição Armazenamento para propósito geral Requisição Arquitetura Tradicional utilizada para DW Carga de trabalho para Data Warehouse: Menos requisições, muita movimentação de dados
20. Resultados Arquitetura Tradicional utilizada para DW Requisição Armazenamento para propósito geral CPU Carga de trabalho para Data Warehouse: A arquitetura tradicional é ineficiente
21. Resultados Arquitetura IBM Netezza Armazenamento Inteligente CPU Requisição Processamento Paralelo Massivo Assimétrico MPP Assimétrico: Armazenamento de Dados Inteligente
22. Resultados Arquitetura IBM Netezza Armazenamento Inteligente CPU Requisição 1% de tráfego na rede 2% de CPU requisitados Processamento Paralelo Massivo Assimétrico MPP Assimétrico: Movimentação de dados altamente eficiente
23. Asymmetric Massively Parallel Processing™ Massively Parallel Intelligent Storage 1 2 3 920 Ÿ Ÿ Ÿ Rede SMP Host Front End Appliance IBM Netezza 1000 High-Speed Loader/Unloader Execution Engine SQL Compiler Query Plan Optimize Admin ODBC 3.X JDBC Type 4 OLE-DB SQL/92 Sistemas Fonte Clientes High Performance Loader 3rd Party Apps DBA CLI ETL Server SOLARIS LINUX HP-UX AIX WINDOWS TRU64 High-Performance Database Engine Streaming joins, aggregations, sorts S-Blade Processor & streaming DB logic S-Blade Processor & stre aming DB logic S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic
24. Asymmetric Massively Parallel Processing™ High-Performance Database Engine Streaming joins, aggregations, sorts S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic Execution Engine Massively Parallel Intelligent Storage 1 2 3 920 Ÿ Ÿ Ÿ Rede SMP Host Front End Appliance IBM Netezza 1000 High-Speed Loader/Unloader SQL Compiler Query Plan Optimize Admin SQL 1 2 3 1 2 3 1 2 3 1 2 3 Snippets 1 2 3 1 2 3 SQL Sistemas Fonte Clientes High Performance Loader 3rd Party Apps DBA CLI ETL Server SOLARIS LINUX HP-UX AIX WINDOWS TRU64
25. Nosso segredo: a FPGA FPGA CPU Descomprime Elimina colunas não usadas Restringe Visibilidade Operações complexas: ∑ Joins, Aggs, etc. select CIDADE, PRODUTO, sum(QUANTIDADE) from VENDAS where MES = '20091201' and SEGMENTO = 509123 and CATEGORIA = 'GASTRO‘ group by CIDADE, PRODUTO order by PRODUTO; Parte da tabela VENDAS (dados comprimidos) where MES = '20091201' and SEGMENTO = 509123 and CATEGORIA = 'GASTRO' sum(QUANTIDADE) group by, order by select CIDADE, PRODUTO, sum(QUANTIDADE)
26. Asymmetric Massively Parallel Processing™ High-Performance Database Engine Streaming joins, aggregations, sorts, etc. S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic S-Blade Processor & streaming DB logic Massively Parallel Intelligent Storage 1 2 3 920 Ÿ Ÿ Ÿ Rede SMP Host Front End Netezza TwinFin Appliance High-Speed Loader/Unloader SQL Compiler Query Plan Optimize Admin Execution Engine ODBC 3.X JDBC Type 4 OLE-DB SQL/92 1 2 3 1 2 3 1 2 3 1 2 3 Consolidação Sistemas Fonte Clientes High Performance Loader 3rd Party Apps DBA CLI ETL Server SOLARIS LINUX HP-UX AIX WINDOWS TRU64
31. Arquitetura tradicional Detecção de Fraude Riscos Grid Analítico Data Warehouse Dados SPSS/SAS C/C++, Java, Python, Fortran, … SQL ETL SQL ETL
32. Desafios Sofisticação analítica limitada? Inflexível para mudar rotas? Toma muito tempo? Trabalhar apenas com amostras Alto custo Processo ineficiente? Dificuldade em realizar experimentos?
33. Arquitetura com IBM Netezza: in-database analytics Detecção de Fraudes Riscos SPSS/SAS Grid Analítico Data Warehouse Dados C/C++, Java, Python, Fortran, … SQL ETL SQL ETL
34. Arquitetura com IBM Netezza: in-database analytics Detecção de Fraude Riscos SPSS/SAS
35. Família de Appliances IBM Netezza Netezza 100 Netezza 1000 Netezza H1000 Ambiente de desenvolvimento e testes Data Warehouse Alta Performance Analytics Alta densidade Queryable Archiving Back-up / DR 1 TB a 10 TB 1 TB a 1.5 PB 100 TB a 10 PB
36.
37.
38.
39. Crie seu perfil no dW: ibm.co/registrodW Participe da comunidade: ibm.co/webcastsdw IBMdeveloperWorksBrasil @soudW
Whether it’s general information such as the location of a restaurant or movie listings at the nearby theater … … or something more esoteric such as population of Timbuktu (54,453 in 2009) … Everyone turns to Google for getting answers to questions big and small But think about it for a second … Would you still use the giant search engine if it took days to get an answer or an army of specialists to tune the system to get the right results back?! Of course not
However … Users in the enterprise, seeking insights from their BI and data warehouse infrastructure to make critical, business-affecting decisions … Are resigned to live with and work around those issues on a daily basis. In fact, according to Gartner, nearly 70% of data warehouses experience performance issues Not to mention the constant tuning and specialist resources required to get any meaningful results from the data in the DW And despite all the tuning and indexing, a single new report or ad-hoc analytic query can throw the data warehouse out of whack … Requiring hours or even days in some cases to process the report … And new indexes, aggregations and tuning parameters to eke out some meager measure of performance from the system … While breaking the older reports, queries and aggregations. And thus the cycle continues!
At the same time … Data continues to grow exponentially in every industry as more information is digitized We as consumers and users of the web and mobile devices are generating a larger and larger footprint Not just our likes and dislikes, but our moods, behaviors, connections, interactions and influences are potentially captured as digital data Interactions between machines – GPS devices, mobile phones, RFID sensors – will dwarf the amount of data generated by humans At the same time, the window of opportunity keep shrinking and the sheer number and complexity of decisions that enterprises make keeps going up dramatically E.g. which ads to place on a website, what offers to make to customers, which network to route calls on, how to detect fraudulent transactions and stop them before completion The rich sources of data offer far deeper insights than were available before, fuelling the need for more analytics As more and more companies “compete on analytics”, the data warehouse becomes mission critical and businesses simply cannot afford to continue with the status quo!
Netezza took a completely fresh approach to analytics, designing an appliance from the ground up specifically for high-performance DW and analytics … an appliance that has totally transformed the data warehousing landscape By making analytics on big data dramatically simpler and faster, Netezza has made analytics more accessible within the enterprise With Netezza, data warehouses are no longer just repositories where data enters to never come out, just “holding pens for data” … Rather they are systems that users rely on to derive meaningful, timely insights that drive the business
As a true appliance, TwinFin is extremely simple to deploy and offers very quick time-to-value It takes less than 2 days from crate-to-query … from the time the system arrives in a crate at the customer site to getting it plugged into the data center to sanity testing to get it ready for data loading The quote from the VP of Technology at eHarmony (the leading trusted online dating site for singles) says it all … http://www.netezza.com/releases/2009/release012809.htm http://www.netezza.com/media/2009/1-25969542_Eprint.pdf
The Nielsen company’s business is all about data Nielsen gathers information from multiple sources and offer clients a complete understanding of what consumers watch, browse and buy Their entire analytics infrastructure is based on Netezza, where their end-user clients run close to a million queries a day … 50 times faster compared to their previous systems The scale and performance of Netezza has been quite transformative at Nielsen I love this quote from their SVP of Application Development … when you’re able to get deep insights in 10 seconds instead of 24 hours, you can do remarkable things with the business http://www.netezza.com/videos/nielsen.aspx
Catalina Marketing maintains the world’s largest loyalty database with 3 year’s of detailed transaction history on more than 250 million consumers, gathered from 35,000 grocery stores and pharmacies With Netezza, they can derive targeted insights individualized to every consumer’s preferences and purchase habits from this multi-petabyte database They use these insights to print coupons at the point of sale that are unique to the individual consumer – essentially treating every consumer as a unique “segment of one” Their models are highly specialized and do a much better job of predicting what consumers want, compared to anyone else in the industry … Yielding coupon redemption rates that are 5-10X better than the average in the retail industry! This translates directly into business value for Catalina! Note: The InformationWeek article below talks about Catalina using Netezza & SAS to bump up the efficiency of the predictive models they create by 10X The time to score a model has gone down from 4 hours to 60 seconds http://www.netezza.com/releases/2007/release111507.htm http://www.netezza.com/releases/2010/release062110_3.htm http://www.informationweek.com/news/showArticle.jhtml?articleID=226600216 http://www.youtube.com/watch?v=WPN88Ni73UE
The simplicity of the “true appliance” approach ultimately delivers much lower total cost of ownership The architecture is pre-tuned and optimized for high-performance analytics … Requiring no tuning, indexing, partitioning, aggregations, etc. – it is a plug-and-play data warehouse At iBasis, one of the largest carriers of international voice traffic in the world … The entire data warehouse team is one or two employees that are needed once every three months .. To operate a multi hundred terabyte data warehouse It’s that simple to administer, delivering very low TCO http://www.netezza.com/releases/2006/release032706.htm http://www.netezza.com/documents/ibasis_casestudy.pdf http://www.youtube.com/watch?v=CoPGzAczY3k
These are the components of the S-Blade. There are 8 Intel-based CPU cores, 8 powerful FPGA engines and 16 GB of RAM available on each S-Blade. Furthermore, each S-Blade is responsible for processing data streamed from 8 disks. Thus there is a 1:1:1 ratio between the disks, processors and FPGA. Each processor and FPGA core pair has 2 GB of RAM available for processing queries and maintaining data in cache.
more efficient get the nuggets you are looking for
more efficient get the nuggets you are looking for
The power of the TwinFin system is our ability to work at “physics” speed. I/O limitations are eliminated as data is processed at disk speed. We designed an architecture that specifically meets the needs of enterprise-class BI. Unlike existing solutions that are an inefficient patchwork of systems, for example an Oracle DBMS + a sun server + EMC storage. TwinFin is what’s in the box – an appliance that integrates software, servers and storage into 1 easy-to-use solution. AMPP is what we are calling our breakthrough architecture – the best combination of SMP and MPP for performance and scalability. We connect to existing apps and tools through SQL, ODBC and JDBC. Data is loaded into the box using our fast loader. On the left side is the SMP host – a Linux OS running on a Compaq 4x:the host handles query plan optimization and table and set ops On the right side is the MPP intelligent storage: for the lower level operations that can be done blindingly fast using parallelization. Our basic unit on this side is the SPU: snippet processing unit – which consists of a disk, FPGA and microprocessor. As we add data/complex queries, the system grows by adding SPUs. We started with a clean sheet of paper and designed an architecture that specifically meets the needs of tera-scale BI
The power of the TwinFin system is our ability to work at “physics” speed. I/O limitations are eliminated as data is processed at disk speed. We designed an architecture that specifically meets the needs of enterprise-class BI. Unlike existing solutions that are an inefficient patchwork of systems, for example an Oracle DBMS + a sun server + EMC storage. TwinFin is what’s in the box – an appliance that integrates software, servers and storage into 1 easy-to-use solution. AMPP is what we are calling our breakthrough architecture – the best combination of SMP and MPP for performance and scalability. We connect to existing apps and tools through SQL, ODBC and JDBC. Data is loaded into the box using our fast loader. On the left side is the SMP host – a Linux OS running on a Compaq 4x:the host handles query plan optimization and table and set ops On the right side is the MPP intelligent storage: for the lower level operations that can be done blindingly fast using parallelization. Our basic unit on this side is the SPU: snippet processing unit – which consists of a disk, FPGA and microprocessor. As we add data/complex queries, the system grows by adding SPUs. We started with a clean sheet of paper and designed an architecture that specifically meets the needs of tera-scale BI
A key component of Netezza’s performance is the way in which its streaming architecture processes data. Um componente-chave de desempenho da Netezza é a maneira pela qual sua arquitetura streaming processa dados. The Netezza architecture uniquely uses the FPGA as a turbocharger … a huge performance accelerator that not only allows the system to keep up with the data stream, but it actually accelerates the data stream through compression before processing it at line rates, ensuring no bottlenecks in the IO path. A arquitetura única que a Netezza utiliza o FPGA como um turbocompressor ... um enorme acelerador de desempenho que não apenas permite que o sistema acompanhe o fluxo de dados, mas ele realmente acelera o fluxo de dados através de compressão antes de processá-lo como taxas de linha, garantindo que não haja pontos de estrangulamento no caminho do IO. You can think of the way that data streaming works in the Netezza as similar to an assembly line. The Netezza assembly line has various stages in the FPGA and CPU cores. Each of these stages, along with the disk and network, operate concurrently, processing different chunks of the data stream at any given point in time. The concurrency within each data stream further increases performance relative to other architectures. Você pode pensar de que maneira o fluxo de dados de trabalha no Netezza, similar a uma linha de montagem. A linha de montagem Netezza tem vários estágios nos núcleos FPGA e CPU. Cada uma destas etapas, juntamente com o disco e rede, operam simultaneamente, processando pedaços diferentes do fluxo de dados em qualquer ponto no tempo. A concorrência dentro de cada fluxo de dados aumenta ainda mais o desempenho em relação às outras arquiteturas. Compressed data gets streamed from disk onto the assembly line at the fastest rate that the physics of the disk would allow. The data could also be cached, in which case it gets served right from memory instead of disk. Comprimido dados transmitidos a partir do disco fica para a linha de montagem na taxa mais rápida que a física do disco permitire. Os dados também podem ser armazenados em cache, caso ele seja melhor na memória em vez de disco The first stage in the assembly line, the Compress Engine within the FPGA core, picks up the data block and uncompresses it at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse—the disk. A primeira fase na linha de montagem, o Mecanismo de compressão dentror do núcleo FPGA, pega o bloco de dados e descompacta-o em grande velocidade, transformando instantaneamente cada bloco de disco em 4-8 blocos na memória. O resultado é uma aceleração significativa do componente que é mais lento em qualquer DW, o disco. The disk block is then passed on to the Project engine or stage, which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed. O bloco de disco passa para o mecanismo do projeto ou fase, que filtra as colunas com base nos parâmetros especificados na cláusula SELECT da consulta SQL a ser processado The assembly line then moves the data block to the Restrict engine, which strips off rows that are not necessary to process the query, based on restrictions specified in the WHERE clause. A linha de montagem, em seguida, move o bloco de dados o motor restrito, que retira as linhas que não são necessárias para o processo de consulta, com base em restrições especificadas na cláusula WHERE The Visibility engine also feeds in additional parameters to the Restrict engine, to filter out rows that should not be “seen” by a query e.g. rows belonging to a transaction that is not committed yet. The Visibility engine is critical in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds in the Netezza. O motor de visibilidade também se alimenta de parâmetros adicionais para o motor restrito, para filtrar as linhas que não devem ser "vistas" por exemplo uma consulta linhas pertencentes a uma transação que não está comprometida ainda. O motor Visibilidade é fundamental na manutenção da ACID (Atomicidade, Consistência, Isolamento e Durabilidade) o respeito às velocidades de transmissão no Netezza The processor core picks up the uncompressed, filtered data block and performs fundamental database operations such as sorts, joins and aggregations on it. It also applies complex algorithms that are embedded in the snippet code for advanced analytics processing. It finally assembles all the intermediate results together from the entire data stream and produces a result for the snippet. The result is then sent over the network fabric to other S-Blades or the host, as directed by the snippet code. O núcleo do processador pega o o bloco de dados descompactados, filtrados e realiza operações de banco de dados fundamentais, tais como os tipos, junções e agregações sobre ele. Também aplica algoritmos complexos que são incorporados no trecho de código para o processamento de análises avançadas. Por último, reúne todos os resultados intermediários juntos de todo o fluxo de dados e produz um resultado para o trecho. O resultado é então enviado através da estrutura da rede para outros S-Blades ou o host, direcionado pelo código..
The power of the TwinFin system is our ability to work at “physics” speed. I/O limitations are eliminated as data is processed at disk speed. We designed an architecture that specifically meets the needs of enterprise-class BI. Unlike existing solutions that are an inefficient patchwork of systems, for example an Oracle DBMS + a sun server + EMC storage. TwinFin is what’s in the box – an appliance that integrates software, servers and storage into 1 easy-to-use solution. AMPP is what we are calling our breakthrough architecture – the best combination of SMP and MPP for performance and scalability. We connect to existing apps and tools through SQL, ODBC and JDBC. Data is loaded into the box using our fast loader. On the left side is the SMP host – a Linux OS running on a Compaq 4x:the host handles query plan optimization and table and set ops On the right side is the MPP intelligent storage: for the lower level operations that can be done blindingly fast using parallelization. Our basic unit on this side is the SPU: snippet processing unit – which consists of a disk, FPGA and microprocessor. As we add data/complex queries, the system grows by adding SPUs. We started with a clean sheet of paper and designed an architecture that specifically meets the needs of tera-scale BI
Disk failover and resiliency is highly improved on TwinFin. Each disk is divided into 3 partitions – one that holds a slice of the user’s data, a mirror of data on another disk and a temp partition that’s used to hold intermediate results. All of these partitions are mirrored, including the temp partition. The primary partition is mirrored in pairs in a RAID 1 format. The Temp partition is laid out across a set of 8 drives in RAID 1+0 format (striped on mirrors).
All drives are visible to all S-Blades within a chassis
Instead of spending time and effort on tedious DBA tasks, use the time for higher BUSINESS VALUE tasks: Bring on new applications and groups Quickly build out new data marts Provide more functionality to your end users
In order to compete effectively in an increasingly complex marketplace, companies continually seek out ways to gain more foresight from their ever increasing data assets and deliver more intelligence at every decision and interaction point. Advanced analytics has long held the promise of doing just that – moving BI from reporting on historical information to predicting future outcomes and ultimately helping employees and partners pick the best alternative from a series of confusing choices for every decision point. Whether the question is a common one such as segmenting customers based on their purchase behavior and demographics, or a more complex one such as optimization of the company’s demand chain, advanced analytics promises to deliver the right results. Despite the obvious benefits, most organizations find it very challenging to deploy advanced analytics across their enterprise and ask complex questions of their growing data volumes … to create a truly Analytic Enterprise.
Let’s look at how analytics has traditionally been done in most organizations. Note: The analytics process can be thought of as 2 distinct activities – modeling and prediction. Modeling is the process of mining historical data to identify patterns and relationships of interest. Once those patterns and relationships are identified, modelers/quants build a mathematical model that describes a particular behavior such as propensity to buy or churn, potential fraud, etc. Prediction is the process of applying the model to data to predict the event described by the model. Modelers and statisticians generally have their own workstations or servers where they perform the modeling tasks. This requires them to move large amounts of data out of the data warehouse onto these systems. The data is cleansed, pre-processed and transformed to fit the needs of the server and analytics tools they are using. These computers are severely limited in how much data they can handle, frequently forcing modeling to be done only on a small sample instead of all the historical data available. Testing the model requires more data to be pulled out from the DW, further slowing down processing. This results in lots of IO and data getting distributed in lots of silos across the organization. Once a model is built, it is deployed on an analytics server, typically a large SMP server or grid that once again pulls data off the DW to run the predictive model on. The prediction is done off-line and data loaded back onto the DW. This is again a slow and cumbersome process, requiring the maintenance of additional infrastructure, while the off-line process often leads to processing of stale data.
In practically all our discussions with customers, we find that the challenges faced by organizations when deploying advanced analytics more broadly are not very different from what they faced with their data warehouse deployments prior to Netezza: The analytics process is expensive and time-consuming – it generally takes weeks to develop a predictive model from the data in the DW. Once the model is developed, it still takes hours or even days in some cases to execute on all the data despite throwing expensive hardware at the problem. The problem is further exacerbated with growing data volumes. The analytics process is extremely inefficient, with the majority of the quants team’s time spent on non-value-added activities (simply moving data around in most cases) rather than data mining and model building. The end result is loss of productivity of expensive resources and dissatisfaction among the business users. The movement of data also leads of analytics silos where you have old and inaccurate data sitting in various places across the organization. Most analytics is done on a subset of the data that organizations have spent lots of time and money collecting & storing. Processing limitations in the underlying infrastructure force quants teams to either avoid asking complex questions or spend time decomposing the problem into piece parts that the systems can handle. Analysts are unable to experiment and perform d eeper and broader analysis to build models that reflect the realities of the complex world. The end result is that despite applying time and resources, companies are unable to fully exploit their data resources and ultimately risk being left behind in the competitive arena.
Let’s look at how analytics has traditionally been done in most organizations. Note: The analytics process can be thought of as 2 distinct activities – modeling and prediction. Modeling is the process of mining historical data to identify patterns and relationships of interest. Once those patterns and relationships are identified, modelers/quants build a mathematical model that describes a particular behavior such as propensity to buy or churn, potential fraud, etc. Prediction is the process of applying the model to data to predict the event described by the model. Modelers and statisticians generally have their own workstations or servers where they perform the modeling tasks. This requires them to move large amounts of data out of the data warehouse onto these systems. The data is cleansed, pre-processed and transformed to fit the needs of the server and analytics tools they are using. These computers are severely limited in how much data they can handle, frequently forcing modeling to be done only on a small sample instead of all the historical data available. Testing the model requires more data to be pulled out from the DW, further slowing down processing. This results in lots of IO and data getting distributed in lots of silos across the organization. Once a model is built, it is deployed on an analytics server, typically a large SMP server or grid that once again pulls data off the DW to run the predictive model on. The prediction is done off-line and data loaded back onto the DW. This is again a slow and cumbersome process, requiring the maintenance of additional infrastructure, while the off-line process often leads to processing of stale data.
Netezza has created an extremely flexible analytics platform that offers orders of magnitude performance at petascale. The integrated, easy-to-use appliance dramatically accelerates the entire analytics process. The programming interfaces and parallelization primitives offered make it straightforward to move a majority of analytics inside appliance, regardless of whether they are being performed in SAS and R or written in Java, Python or Fortran. By bringing analytics to the data, modelers and quants teams can operate on the data directly inside the appliance instead of having to move it to a different location and dealing with the associated data pre-processing and transformation. More importantly, modelers can take full advantage of the MPP architecture to ask the most complex questions on all the enterprise data, without the infrastructure coming in the way. They can iterate through different models more quickly to find the best fit. Administrators can easily create marts and sandboxes inside the appliance to allow modelers to work on their analytics problems without disrupting normal business operations. Data stays within a central repository instead of getting distributed all over the organization. Once the model is developed, it is seamless to put it into prediction mode. The prediction and scoring can be done right where the data resides, inline with other processing, on an as-needed basis. Users can get the results of prediction scores in near real-time, helping operationalize advanced analytics and making it available throughout the enterprise.