• Share
  • Email
  • Embed
  • Like
  • Private Content
Analiza danych przy użyciu IBM Netezza Analytics
 

Analiza danych przy użyciu IBM Netezza Analytics

on

  • 2,007 views

Grzegorz Puchawski

Grzegorz Puchawski
Data analysis within
IBM Netezza Analytics

Statistics

Views

Total Views
2,007
Views on SlideShare
2,007
Embed Views
0

Actions

Likes
1
Downloads
71
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Analiza danych przy użyciu IBM Netezza Analytics Analiza danych przy użyciu IBM Netezza Analytics Presentation Transcript

    • 14 czerwca 2011 r.Warszawa, Sheraton Warsaw Hotel Grzegorz Puchawski Data analysis within IBM Netezza Analytics 11
    • In a nutshell, what isIBM Netezza Analytics 11
    • Big Data Meets Big Math Analytics Without Constraints
    • Massive Data and Massive Computation Data Intensity Computational Intensity Depth of Data Computational Complexity Width of Data Model Complexity
    • In-Database Analytics Data Predictive R Customer/Data Prep Spatial Mining Analytics Analytics Partner Analytics nzAnalytics Open Source Analytics Custom Parallel Analytic Engines Software Development Kit nzAdaptors nzEngine nzPlug-in nzPackage nzEngine for nzMatrix for for for for Hadoop C, C++, Java, R Eclipse R GUI Python, Fortran Streaming Accelerator IBM Netezza AMPP™ Platform 11
    • Who is the target audiencefor IBM Netezza Analytics? 11
    • Who is the target audience for IBM Netezza Analytics?• Line of Business Owner – Areas of Interest – Gaining / sustaining competitive advantage, discovering new opportunities to increase revenue or decrease costs, ability to use all data collecting – Benefits – Fast results, add significant business value / big bets, leverage all the data, performance at scale• Business Intelligence – Areas of Interest – Analysis beyond SQL, analytics dashboards and reports – Benefits – Rich set of analytics beyond SQL• Data Miners – Areas of Interest – Marketing, life sciences, fraud, network analysis – Benefits – Ability to explore more data, quick to failure, identify new opportunities, new package of analytic tools, ability to process large data
    • Who is the target audience for IBM Netezza Analytics?• Modelers – Areas of Interest – Logistics, yield, forecasting, risk – Benefits– Simplification of analytic processes, ability to use new and innovative models, quick to failure, model at scale using parallelized analytics, score at scale• Quants / Statisticians – Areas of Interest – Risk, forecasting, descriptive statistics, correlation of factors – Benefits – Simplification of analytic processes, quick to failure, in-database analytics• Programmers, Developers – Areas of Interest – Low level programming tools, multi-language environment, User Defined Functions (UDFs), User Defined Analytic Process (AEs), Eclipse – Benefits – Power and simplification of in-database analytics, flexibility of porting analytics/application
    • How is the IBM Netezza Analytics platform used? 11
    • High Performance on Massive Data • Data Exploration1 Exploratory Data Analysis • Data Cleansing • Data Transformation2 • Descriptive Modeling Build Model • Predictive Modeling • Optimization Model • Scoring3 Deploy Model • Forecasting • Decision Management4 • Embarrassingly Parallel Algorithms Embed Algorithms • Heroic Computations • Model Parallelism 11
    • Embed Algorithms Exploratory Data Deploy / Score Build Model Model Analysis User Interface User Interface User Interface Analytics Eclipse SQL SQL nzAnalytics R GUI/CLI R GUI/CLI R GUI/CLI R Analytics Eclipse Customer Analytics Partner AnalyticsDevelopment Env. Analytics Analytics Deploy/Scoring UDF, UDAP nzAnalytics nzAnalytics nzAdaptorsStored Procedures R Analytics R Analytics UDF, UDAP Shared Libraries Customer Analytics Customer Analytics Shared Library nzAdaptors Partner Analytics Partner Analytics Stored Procedures nzMatrix nzPackage for R R 11
    • Embedding Algorithms• What is it? – The ability to run programs directly on the S-Blade• What is it used for? – Bringing complex computation to the Netezza data stream• What technology does it use? – User Interface – Eclipse, R GUI/CLI – Development Environment - UDFs, User Defined Analytic Process, Stored Procedures, Shared Libraries, nzAdaptors, nzMatrix, R Packages (for implementing algorithms run from R GUI)• What are the benefits? – Ability to process data as it stream directly on the S-Blade – Ability to harness total compute power of a TwinFin for parallel processing
    • Exploratory Data Analysis• What is it? – The exercise of looking at data for the purpose of coming up with hypotheses• What is it used for? – Exploratory data analysis – Data profiling/ Descriptive Statistics, General Diagnostic Measures, Statistics, Sampling, Histograms – Data cleansing – Feature selection – Data transformation – Data Prep / Transformations• What technology does it use? – User interface – SQL, R GUI/CLI, others – Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics• What are the benefits? – Discovery on more data, faster
    • Build Model• What is it? – Choosing which method will give the best results – Finding the best parameters to give the best predictions• What is it used for? – Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing, Sample Size – Data Mining – Association Rules Mining, Clustering, Feature Selection• What technology does it use? – User interface – SQL, R GUI/CLI, Eclipse, ... – Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics – Development tools – Language Adapters, UDFs, UDAP, Stored Procedures• What are the benefits? – Moving the computation processing to the data – Parallel computational processing on all of the data
    • Deploying / Scoring Model• What is it? – Parallelized application of a model using parameters from the build step• What is it used for? – Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing – Data Mining – Association Rules Mining, Clustering, Feature Extraction• What technology does it use? – Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics – Deploying/Scoring – Language Adapters, UDFs, User Defined Analytic Process, Shared Libraries, Stored Procedures, R package – Development tools – Language Adapters, UDFs, AE, Stored Procedures• What are the benefits? – Score and experiment in parallel – Faster model scoring and therefore time to insight/value
    • What is in IBM Netezza Analytics? 11
    • In-Database Analytics Data Predictive R Customer/Data Prep Spatial Mining Analytics Analytics Partner Analytics nzAnalytics Open Source Analytics Custom Parallel Analytic Engines Software Development Kit nzAdaptors nzEngine nzPlug-in nzPackage nzEngine for nzMatrix for for for for Hadoop C, C++, Java, R Eclipse R GUI Python, Fortran Streaming Accelerator IBM Netezza AMPP™ Platform 11
    • Streaming Accelerator Streaming Accelerator Netezza AMPP™ Platform• What is it? – Our unique differentiator that combines our historical strength in fast data stream processing with powerful in-database analytics processing and new inter-node analytics processing capabilities• What is it used for? – Parallelizing data and analytics processing• What technology does it use? – FPGA – UDFs, User Defined Analytic Process – Message Passing Interface (MPI) for distributed processing• What are the benefits? – Accelerates data processing for analytics – Accelerates parallel matrix operations on big data – Simplifies parallelization
    • Parallel Analytic Engines IBM Netezza Matrix Engine nzMatrix• What is it? – Parallelized linear algebra package• What is it used for? – Building block for higher order parallelized analytics• What technology does it use? – Scalable Linear Algebra Package (ScaLAPACK) – Message Passing Interface (MPI) for distributed processing• What are the benefits? – Simplifies analytic algorithm and model development – Accelerates parallel matrix operations on big data
    • Parallel Analytic Engines IBM Netezza Matrix Engine nzMatrix• Supports the following parallel matrix operations – Basic Linear Algebra Subroutines (ie: Matrix Multiplication, Matrix Dot Function , etc.) – Solving a System of Linear Equations – Solving Linear Least Squared Problems – Eigenvalues and Eigenvectors – Singular Value Decomposition (SVD) – Matrix Factorization – Matrix Inversion – Matrix Element Scalar Functions – Matrix Reduction Functions (e.g. min, max, sum of squares, sum) – Matrix Inquiry Functions  (e.g. number of rows and columns) – Matrix Reshaping Functions• Call Interface – Accessible from R, Python, Java, etc. via ODBC and Stored Procedures
    • Parallel Analytic Engines IBM Netezza Engine for Hadoop Hadoop• What is it? – Hadoop-compatible implementation of Hadoop (MapReduce paradigm)• What is it used for? – Clickstream & social data analysis – ETL/ELT and analytics processing of key/value pairs• What technology does it use? – Java User Defined Analytic Process• What are the benefits? – Enables effective parallel processing of data from Netezza database tables – Bringing Hadoop to database with minimal refactoring of existing Hadoop code – Only database offering Hadoop interface (all others are home-grown)
    • Hadoop by Apache vs Hadoop by Netezza Mapper 1 Reducer 1 Slice 1 Mapper 2 Slice 2 Reducer 2 Slice 3 Mapper 3 Slice 4 R E Reducer 2 Mapper 4 DI HDFS S HDFS Input table T Output table(dataslices) Cluster nodes RI Cluster nodes (dataslices) SPUs B SPUs U TI O
    • Parallel Analytic EnginesNetezza Engine for Hadoop Hadoop Example• Example: Clickstream analysis – Data: • Table containing data about users and visited pages • User groups’ definitions – Task: • For each group, find all pages that have been visited by all members of this group 23
    • Parallel Analytic EnginesNetezza Engine for Hadoop Hadoop Example• Sample data: Clickstream analysis USER URL GROUP USER A ibm.com FIRST A A netezza.com FIRST B A sheraton.pl SECOND A B ibm.com SECOND D D netezza.com D apache.org GROUP URL FIRST ibm.com SECOND netezza.com 24
    • Parallel Analytic Engines IBM Netezza Engine for R R• What is it? – Native R running pushed down onto the S-Blade for parallel analytics processing• What is it used for? – Exploratory data analysis, building models, scoring models, etc• What technology does it use? – Open Source R – User Defined Analytic Process, Data Stream Processing• What are the benefits? – Accelerates and scales R to run on big data – Leverage open-source CRAN repository of algorithms• Supports the following parallel R operations – R interpreter running in parallel – R CRAN Analytics applied in parallel• Call Interface – Invoked via SQL (a la User Defined Analytic Process) , R
    • In-Database Analytics In-database Analytics nzAnalytics• What is it? – Parallelized in-database analytics for data prep, data mining, prediction, and geospatial• What is it used for? – Building and deploying/scoring models• What technology does it use? – UDFs, Stored Procedures, User Defined Analytic Process, nzMatrix• What are the benefits? – Starter kit of parallelized analytics that are designed for parallel environment that work on large scale data
    • In-Database Analytics In-database Analytics nzAnalytics Data PrepData Profiling / Statistics SamplingDescriptive Statistics Histogram and Frequency Uniform Random SamplingProbability Density and Table • Uniform Random SamplingInverse Functions Count • Histogram• Normal • Bivariate Frequency Table • Uniform Random Sampling• Fisher Fraction• Exponential • Univariate Frequency Table• Uniform Quantiles Data Prep /• Weibull• Wilcoxn • Quantiles Transformations • Median• Man-Whitney Binning and Discretization • Outliers • Entropy Minimization• tStudent • Quartile • Equal Width• Chi-Square Parametric Statistics • Equal Frequency • Chi-Square Standardization andGeneral Diagnostic • tStudent NormalizationMeasures Non-Parametric Statistics • Standardization and • Spearman’s Rank Correlation NormalizationError Calculation • Man-Whitney-Wilcoxn• Classification Error • Wilcoxn• Mean Absolute Error Moments• Mean Squared Error • Kurtosis• Relative Absolute Error • Skewness• Relative Squared Error 11
    • In-Database Analytics In-database Analytics nzAnalytics Data Mining Predictive AnalyticsAssociation Rules Sample Size Bayesian MethodsMining One-Way ANOVA ClassifierAssociation • Complete Randomized Design • Naïve Bayes• FP-Growth • Randomized Block Design Graphical Model • Bayesian Networks RegressionClustering Model Testing Linear RegressionK-Means • Generalized Linear Models Error Calculation • Cross ValidationHierarchical Clustering Classification • Percentage Split• Divisive Clustering • Train / Test• Agglomerative Clustering Decision Trees • Entropy Decision Tree • Gini Index Decision Tree • Regression TreeFeature Extraction Neighborhood Methods • K Nearest NeighborsDimension Reduction• Principal Components Analysis 11
    • What are these data mining algorithms used for?Association Rules Mining Clustering Feature Extraction• Find co-occurring items • Finding naturally occurring • Identify most influential in a market basket groups attributes for a target attribute – Suggest product – Market segmentation > Factors associated with combinations – Find disease subgroups high costs, responding to an – Design better item – Distinguish normal from offer, etc. placement on shelves non-normal behavior A A A A A A A A 1 2 3 4 5 6 7 8 11
    • What are these data mining algorithms used for? Regression Classification • Predict a numeric value • Predict customers most > Predict a purchase likely to: amount or cost – Respond to a campaign > Predict the value of a or offer home – Incur the highest costs • Target your best customers • Develop customer profiles 11
    • Association Rules Mining Example Find co-occurring items in a market basketRegular database IBM Netezza Analytics Support Time Itemsets• # transactions = 71M 1% (708 208) 1m 87• # items = 250k 0.1% (70 828) 16m 4000 0.01% (7 082) 41m 5 583 391• Implementation in SQL 0.001% (708) 51m 346 749 521• Offline process • In-database Analytics using• Computation time around FPGrowth algorithm ~5 hours • Ability to run on-demand analysis 11
    • In-Database Analytics IBM Netezza Spatial Engine Open Source• What is it? – Location Intelligence Extension for IBM Netezza TwinFin Appliance• What is it used for? – Processing queries about geographical data to perform spatial analysis• What technology does it use? – GGL, GEOS libraries• What are the benefits? – Set of the functions to run GIS analysis on large size of data. – Analyze spatial information all in the database. – Better and faster analysis using spatial data.
    • In-Database Analytics Spatial Concepts Open Source• Goal: to process queries about geometric features or geographical data in order to perform various types of analysis.• Examples of geographical data: – The location of a store, a wireless service tower or other landmark – A running feature such as street, river or power line• Examples of spatial analysis: – Identify the number of wireless calls that occur in a particular area so that you can better plan the addition of new towers to improve wireless service – Calculate driving distance form a certain point to the nearest N fire stations to calculate the cost of insurance premium
    • Examples of Usage – Area – Distance – Length – Perimiter• Because IBM Netezza Spatial functions are implemented as UDFs, it allows us to utilize the full potential of Netezza’s Massively Parrallel Processing Architecture
    • Software Development Kit SDK – nzAdaptors nzAdaptors for C, C++, Java, Python, Fortran• What is it? – APIs that allow in-database user defined functions to be written in various languages• What is it used for? – Enable any program to run on the S-Blades (with minimal refactoring)• What technology does it use? – User Defined Analytic Process• What are the benefits? – Flexibility to build and deploy analytics/models in multiple languages – Eliminate rewriting of model score code having to be rewritten and revalidated – Analytics can be written in different language than calling application language• Supports the following parallel operations – Parallel execution of the analytic, model, application• Call Interface – Language-specific API – Invoked via SQL
    • Software Development Kit SDK – nzPackage for R nzPackage for R• What is it? – R packages that integrates the R GUI/CLI with Netezza • Provide interfaces to tables, matrices, apply operations, and nzAnalytics• What is it used for? – Data frame integration with data warehouse, pushing analytics processing S-Blades, scoring on S-Blades, installation of R packages, integration with SQL, Matrix integration• What technology does it use? – R API for creating packages, open-source CRAN packages (e.g., RODBC)• What are the benefits? – Ability to use S-Blades for scaling R analytics/models – Large-scale linear algebra via Matrix – Access to nzAnalytics from R
    • Software Development Kit SDK – nzPlug-in for Eclipse nzPlug- in for Eclipse• What is it? – A plug-in for Eclipse that facilitates easier development of UDFs and Stored Procedures• What is it used for? – UDFs and Stored Procedure wizards – Remote SSH terminal, database object explorer, SQL editors, source code control, issue management, system monitoring, documentation builder• What technology does it use? – Eclipse• What are the benefits? – Faster, more targeted development – Leverage the many available open-source plug-ins for Eclipse
    • Software Development Kit nzPlug- Netezza plugin for Eclipse in for Eclipse• What’s included – Predefined Project Perspective – NZ Admin – NZ Cartridge Manager – Logs Browser – Editors with Syntax Highlighting – Remote Console and SSH Terminals – Template Wizards (NZ project, UDX, UDTF, Stored Procedures, Makefile, …) – Synchronization between local and remote projects – Data Tools – Database Object Explorer, SQL Editors, Data Explorer, … (with support for Netezza database)
    • What are the key points? 11
    • Key Points• Target Audience – Line of Business, BI, Data Miners, Modelers, Quants, Statisticians, Programmers• IBM Netezza Analytics Uses – Embedding algorithms, exploratory data analysis, building model, deploying/scoring model• 3 Major Components of IBM Netezza Analytics – Parallel Analytic Engines, SDK, In-Database Analytics• Streaming Accelerator – Unique differentiator for combination of data stream processing, in-database analytics processing and inter-node processing
    • Key Differentiators• Faster and scalable analytics processing• Parallelized in-database analytics• Large scale matrix operations• Rich development environment
    • Key Benefits• Eliminates inefficient analytics data processing - data remains in place• Speeds up time to insight, action & business value• Achieves parallelism without parallel programming• Enables increased analytics experimentation• Protects and leverages investment in existing analytics• Reduces technology barriers for large scale analytics
    • IBM Netezza Analytics Big BigData Math IBM Netezza Analytics 11
    • Thank youANALYTICS Your Data. Your Site. Our Appliance.