Data Science Overview (Oct. 3rd, 2012)Presentation Transcript
Data Science Meetup Finalize Data Science Teams Technology Overview Data Science Tools Data Science Resources October 3, 2012
Presentation by:Michael WalkerRose Business Technologies720.email@example.com://www.rosebt.com
Agenda6:00 - 6:30 Overview - Finalize Data Science Teams: Michael Walker6:30 - 7:00 Hadoop/Mapreduce Presentation: John Dougherty7:00 - 7:15 Qubole Presentation: Sadiq Shaik7:15 - 7:45 Kognitio Presentation: Reggie Arizmendi7:45 - 8:00 Network
Hype Cycle for Emerging Tech 2012
Hype Cycle for Big Data 2012
Top 5 Big Data Challenges1. Deciding what data is relevant2. Cost of technology infrastructure3. Lack of skills to analyze the data4. Lack of skills to manage big data projects5. Lack of business support
Most Difficult Big Data Skills to Find1. Advanced analytics, predictive analytics2. Complex event processing3. Rules management4. Business intelligence tools5. Data integration
Big Data DriversAnalysis of…:1. Operational data2. Online customer data3. Sales transactions data4. Machine or device data5. Service innovation
DefinitionsBig data analytics is the application of advanced analytic techniques to very big data sets.Big data is a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of databy enabling high-velocity capture, discovery and/oranalysis.
Horizontal & Vertical ApplicationsBig Data technology can be deployed for business processes such as the following:• Customer relationship management (sales, marketing, customer service)• Supply chain and operations• Administration (ﬁnance and accounting, human resources, legal)• Research and development• Information technology management• Risk management
Horizontal & Vertical ApplicationsIn addition, big data technology can be used for industry- speciﬁc applications such as the following:• Logistics optimization in the transportation industry• Price optimization in the retail industry• Intellectual property management in the media and entertainment industry• Natural resource exploration in the oil and gas industry• Warranty management in the manufacturing industry• Crime prevention and investigation in local law enforcement• Predictive damage assessments in the insurance industry• Fraud detection in the banking industry• Patient treatment and fraud detection in the healthcare industry
Data Science TeamsFour (4) person teamsOptimal skill mix:1. Business Leader (consumer)2. Statistics3. Data Modeler4. IT
Data Science Use Case / ScenarioEach team selects a use case / scenarioThesisData sourcesAnalytical tools / platforms
Use CaseExample: I suggest there is a correlation between size of government and economic growth.Thesis: Bigger government = slower economic growthData source: Open data source from government stats; yahoo finance- bloombergTool: Qubole on Amazon PaaS
Data ModelingA data model is a plan for building a database.To use a common analogy, the data model is equivalent to an architects building plans.
Data ModelingThree different types of data models:1) Conceptual data models.These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.
Data Modeling2) Logical data models (LDMs).LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).
Data Modeling3) Physical data models (PDMs).PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.
Models of DataA framework to organize and analyze data.Predictive, Descriptive, Prescriptive AnalyticsThere are three types of data analysis:Predictive (forecasting)Descriptive (business intelligence and data mining)Prescriptive (optimization and simulation)
Models of DataPredictive AnalyticsPredictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the probable future outcome of an event or a likelihood of a situation occurring.Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events.
Models of DataPredictive AnalyticsThree basic cornerstones of predictive analytics are:Predictive modelingDecision Analysis and OptimizationTransaction ProfilingAn example of using predictive analytics is optimizing customer relationship management systems. They can help enable an organization to analyze all customer data therefore exposing patterns that predict customer behavior.
Models of DataPredictive AnalyticsAnother example is for an organization that offers multiple products, predictive analytics can help analyze customers’ spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers.This directly leads to higher profitability per customer and stronger customer relationships.
Models of DataDescriptive AnalyticsDescriptive analytics looks at data and analyzes past events for insight as to how to approach the future. Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure.Almost all management reporting such as sales, marketing, operations, and finance, uses this type of post-mortem analysis.
Models of DataDescriptive AnalyticsDescriptive models quantify relationships in data in a way that is often used to classify customers or prospects into groups. Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products.Descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do.
Models of DataDescriptive AnalyticsDescriptive models can be used, for example, to categorize customers by their product preferences and life stage. Descriptive modeling tools can be utilized to develop further models that can simulate large number of individualized agents and make predictions.For example, descriptive analytics examines historical electricity usage data to help plan power needs and allow electric companies to set optimal prices.
Models of DataPrescriptive AnalyticsPrescriptive analytics automatically synthesizes big data, mathematical sciences, business rules, and machine learning to make predictions and then suggests decision options to take advantage of the predictions.Prescriptive analytics goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option. Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen.
Models of DataPrescriptive AnalyticsFurther, prescriptive analytics can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each decision option.In practice, prescriptive analytics can continually and automatically process new data to improve prediction accuracy and provide better decision options.
Models of DataPrescriptive AnalyticsPrescriptive analytics synergistically combines data, business rules, and mathematical models. The data inputs to prescriptive analytics may come from multiple sources, internal (inside the organization) and external (social media). The data may also be structured, which includes numerical and categorical data, as well as unstructured data, such as text, images, audio, and video data, including big data. Business rules define the business process and include constraints, preferences, policies, best practices, and boundaries. Mathematical models are techniques derived from mathematical sciences and related disciplines including applied statistics, machine learning, operations research, and natural language processing.
Models of DataPrescriptive AnalyticsFor example, prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demographic trends and population health trends, to more accurately plan for future capital investments such as new facilities and equipment utilization as well as understand the trade-offs between adding additional beds and expanding an existing facility versus building a new one.
Models of DataPrescriptive AnalyticsAnother example is energy and utilities. Natural gas prices fluctuate dramatically depending upon supply, demand, econometrics, geo-politics, and weather conditions. Gas producers, transmission (pipeline) companies and utility firms have a keen interest in more accurately predicting gas prices so that they can lock in favorable terms while hedging downside risk.Prescriptive analytics can accurately predict prices by modeling internal and external variables simultaneously and also provide decision options and show the impact of each decision option.
Analytical TechnologiesToolsERwin Data ModelerStrategyCompanionTalendPentahoHortonworksMetalabSASSPSSPSPP
Open Data SourcesFreebaseData HubNumbraryPeter Skomorochs Delicious DataInfoChimpsOpen Data SitesDBpediatheinfo.orgLending Club StatisticsMAF/TIGER (US Census Geo) DatabaseReuters Corpora (RCV1, RCV2, TRC2)Open Street MapMusicBrainzJigsawOpentick
Open Data SourcesHistorical Data, Yahoo FinanceHistorical Foreign Exchange Data, Federal Reserve Bank of New YorkGraduate School of Business, Stanford UniversityProprietary Trading Articles & ResourcesWilmott.comDefaultRisk.com, Credit Risk Modeling Resource: Papers, Books, Conferences, JobsForex Factory, ForumsNBER Papers in Asset Pricing: Stocks, Bonds and Foreign CurrencyFinancial Engineering Books, International Association of Financial Engineers
Open Data Sources•• Literacy, Gross Domestic Product, Income and Military Expenditures for 154 Countries• Continent Codes for Countries• Source: Various Wikipedia Articles• Daily Precipitation, Min and Max Temperatures for Berkeley for the first 10 months of 2005• Source: http://hurricane.ncdc.noaa.gov/dly/DLY• Release Dates and Box Office Earnings for Top Movies• Source: http://www.movieweb.com/movies/boxoffice/alltime.php• See Also: http://imdb.com/Top/• Bush-Kerry Election Results 2004• US State Population, 2003 and 2004• Source: http://www.factmonster.com/ipka/A0004986.html• Information about Cars (1978-1979)• Diabetes in Pima Indians• Information about Diabetes data source: http://www.ics.uci.edu/~mlearn/MLRepository.html• Updated world data with new variables• Wine Recognition Data• Information about Wine data source: http://www.ics.uci.edu/~mlearn/MLRepository.html• Nutritional Information about Crackers source: http://www.math.csi.cuny.edu/st/Projects• XML Plant Catalog source: http://www.w3schools.com/xml/• US Wheat Production 1910-2004 source: http://usda.mannlib.cornell.edu/data-sets/crops/88008/• Birthdays and Terms of US Senators source: Wikipedia• Weight and Sleep Information of Various Animals• Information about Sleep Data Set• SQLite Album database• Iron dataset
Eight Levels of Analytics
Statistical AnalysisStatistical Analysis answers the questions: Why is this happening? What opportunities am I missing?Example: Banks can discover why an increasing number of customers are refinancing their homes.Here we can begin to run some complex analytics, like frequency models and regression analysis. We can begin to look at why things are happening using the stored data and then begin to answer questions based on the data.
ForecastingForecasting answers the questions: What if these trends continue? How much is needed? When will it be needed?Example: Retailers can predict how demand for individual products will vary from store to store.Forecasting is one of the hottest markets – and hottest analytical applications – right now. It applies everywhere. In particular, forecasting demand helps supply just enough inventory, so you don’t run out or have too much.
Predictive ModelingPredictive Modeling answers the questions: What will happen next? How will it affect my business?Example: Hotels and casinos can predict which VIP customers will be more interested in particular vacation packages. If you have 10 million customers and want to do a marketing campaign, whos most likely to respond? How do you segment that group? And how do you determine whos most likely to leave your organization? Predictive modeling provides the answers.
OptimizationOptimization answers the question: How do we do things better? What is the best decision for a complex problem?Example: Given business priorities, resource constraints and available technology, determine the best way to optimize your IT platform to satisfy the needs of every user.Optimization supports innovation. It takes your resources and needs into consideration and helps you find the best possible way to accomplish your goals.
Conceptual ModelingConceptual Modeling brings together the business and technology views to define the solution scope.It is more than technical architecture or data context diagrams. Technical architecture and data context diagrams have their place, but the critical skill is the business view (vs. technical view) of the solution scope.This is critical to engaging stakeholders and setting the stage for innovation.
Statistical ModelsNonparametric TestsT-testANOVA & MANOVAANCOVA & MANCOVALinear RegressionGeneralized Least SquaresRidge RegressionLassoGeneralized Linear ModelsMixed Effects Models
Statistical ModelsLogistic RegressionNonlinear RegressionDiscriminant AnalysisNearest NeighborFactor & Principal Components AnalysisCopula ModelsCross-ValidationBayesian StatisticsMonte Carlo, Classic MethodsMarkov Chain Monte Carlo
Statistical ModelsTwo simple yet powerful models:Generalized Linear Regression ModelRandom ForestsSuggestion: Keep it simple for the first use case.
Predictive Modeling Techniques
Predictive Modeling TechniquesProblems with some predictive modeling techniques. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used. 1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression) 2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree. 3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead. 4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions.
Predictive Modeling Techniques5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data.7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths.8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.And remember to use sound cross-validations techniques when testing models!
Predictive Modeling TechniquesPoor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set. Good cross validations consist in: • splitting your training set into multiple subsets (test and control subsets), • include different types of clients and more recent data in the control sets (than in your test sets) • check quality of forecasted values on control sets • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)
Statistical SoftwareAlmost all serious statistical analysis is done in one of the following packages: R (SPlus), Matlab, SAS, SPSS and Stata.It does not mean that each of those packages is good for a specific type of analysis. In fact, for most advanced areas, only 2-3 packages will be suitable, providing enough functionality or enough tools to implement this functionality easily.For example, a very important area of Markov Chain Monte Carlo is doable in R, Matlab and SAS only, unless you want to rely on convoluted macros written by random users on the web.
Statistical SoftwareR & MATLABR and Matlab are the richest systems by far. They contain an impressive amount of libraries, which is growing each day. Even if a desired very specific model is not part of the standard functionality, you can implement it yourself, because R and Matlab are really programming languages with relatively simple syntaxes. As "languages" they allow you to express any idea. The question is whether you are a good writer or not. In terms of modern applied statistics tools, R libraries are somewhat richer than those of Matlab. Also R is free. On the flip side, Matlab has much better graphics, which you will not be ashamed to put in a paper or a presentation.
Statistical SoftwareSPSSOn the other end of the spectrum is a package like SPSS. SPSS is quite narrow in its capabilities and allows you to do only about half of the mainstream statistics. It is quite useless for ambitious modeling and estimation procedures which are part of kernel smoothing, pattern recognition or signal processing. Nonetheless, SPSS is very popular among the practitioners because it does not require almost any programming training. All you have to do is hit several buttons and SPSS does all the calculations for you. In those cases when you need something standard, SPSS may have it implemented fully. The SPSS output will be quite detailed and visually pleasing. It will contain all the major tests and diagnostic tools associated with the method and will allow you to write an informative statistics section of your empirical analysis. In short, when the method is there, it is faster to run than a similar functionality in R or Matlab. So I use SPSS often for standard requests from my clients, like running linear regression, ANOVA or principal components analysis. SPSS gives you the ability to program macros, but that feature is quite inflexible.
Statistical SoftwareSAS & STATASomewhere in-between R, Matlab and SPSS lie SAS and Stata. SAS is more extensive analytics than Stata. It is composed of dozens of procedures with massive, massive output, often covering more than ten pages. The idea of SAS is not to listen to you that much. It is like an old grandfather, which you approach with a simple question but instead he tells you the story of his life. Many procedures contain three times more than what you need to know about that segment. So some time has to be spent on filtering in the relevant output. SAS procedures are invoked using simple scripts. Stata procedures can be invoked by clicking buttons in the menu or by running simple scripts. In the menu part, Stata resembles SPSS. Both SAS and Stata are programming languages, so they allow you to build analytics around standard procedures. Stata is somewhat more flexible than SAS. Still, in terms of programming flexibility, Stata and SAS do not come even close to R or Matlab. Selected strengths of SAS compared to all other packages: large data sets, speed, beautiful graphics, flexibility in formatting the output, time series procedures, counting processes. Selected strengths of Stata compared to all other packages: manipulation of survey data (stratified samples, clustering), robust estimation and tests, longitudinal data methods, multivariate time series.
Statistical SoftwareUseful Resources:American Statistical AssociationDepartment of Statistics, Stanford UniversityElementary Statistics Books Available to Download for Free
Statistical Software• Downloading R• R Manuals (at CRAN)• Accessing the SCF Remotely (includes how to get the necessary software)• Class Bulletin Board (bspace)• Driver to convert Windows Documents to PDF• Introduction to R (pdf)• Slides for a Course in R (pdf)• R Graph Gallery• statsnetbase Search for R Graphics to read Paul Murrells book about plotting in R• Some Notes on Saving Plots in R• Free Graphical MySQL Client• SQLite Graphical Client for Windows• Instructions on running the Firefox SQLiteManager extension as an application on Mac OSX• Accessing the Class MySQL Server through an SSH Tunnel• Connecting to the MySQL server under Windows• Introduction to Cluster Analysis (statsoft.nl)• Fruit pictures for the "slot machine" (zipped)• R TclTk examples• More R TclTk examples• Additional GUI examples: Deal or No Deal Piano• HTML Form Tutorial• Setting up your account for CGI scripting• Running your own Webserver to test CGI programs (Mac & Linux)• Notes on Document Preparation with Latex• vi reference card• emacs reference card• R reference card• More information on Dates and Times in R• More information on Factors in R
Statistical SoftwareBooks • Competing on Analytics • Analytics at Work • Super Crunchers • The Numerati • Data Driven • Data Source Handbook • Programming Collective Intelligence • Mining the Social Web • Data Analysis with Open Source Tools • Visualizing Data • The Visual Display of Quantitative Information • Envisioning Information • Visual Explanations: Images and Quantities, Evidence and Narrative • Beautiful Evidence • Think Stats • Data Analysis Using Regression and Multilevel/Hierarchical Models • Applied Longitudinal Data Analysis • Design of Observational Studies • Statistical Rules of Thumb • All of Statistics • A Handbook of Statistical Analyses Using R • Mathematical Statistics and Data Analysis • The Elements of Statistical Learning • Counterfactuals and Causal Inference
Statistical Software • • Mining of Massive Data Sets • Data Analysis: What Can Be Learned From the Past 50 Years • Bias and Causation • Regression Modeling Strategies • Probably Not • Statistics as Principled Argument • The Practice of Data AnalysisGreat class notes on Data Science: http://statistics.berkeley.edu/classes/s133/all2011.pdfRelated Workshops • Data Bootcamp, Strata 2011 • Machine Learning Summer School, Purdue 2011 • Looking at Data
Statistical SoftwareCourses • Concepts in Computing with Data, Berkeley • Practical Machine Learning, Berkeley • Artificial Intelligence, Berkeley • Visualization, Berkeley • Data Mining and Analytics in Intelligent Business Services, Berkeley • Data Science and Analytics: Thought Leaders, Berkeley • Machine Learning, Stanford • Paradigms for Computing with Data, Stanford • Mining Massive Data Sets, Stanford • Data Visualization, Stanford • Algorithms for Massive Data Set Analysis, Stanford • Research Topics in Interactive Data Analysis, Stanford • Data Mining, Stanford • Machine Learning, CMU • Statistical Computing, CMU • Machine Learning with Large Datasets, CMU • Machine Learning, MIT • Data Mining, MIT • Statistical Learning Theory and Applications, MIT • Data Literacy, MIT • Introduction to Data Mining, UIUC • Learning from Data, Caltech • Introduction to Statistics, Harvard • Data-Intensive Information Processing Applications, University of Maryland
Statistical Software • Dealing with Massive Data, Columbia • Data-Driven Modeling, Columbia • Introduction to Data Mining and Analysis, Georgia Tech • Computational Data Analysis: Foundations of Machine Learning and Da..., Georgia Tech • Applied Statistical Computing, Iowa State • Data Visualization, Rice • Data Warehousing and Data Mining, NYU • Data Mining in Engineering, Toronto • Machine Learning and Data Mining, UC Irvine • Knowledge Discovery from Data, Cal Poly • Large Scale Learning, University of Chicago • Data Science: Large-scale Advanced Data Analysis, University of Florida • Strategies for Statistical Data Analysis, Universität LeipzigVideos • Lies, damned lies and statistics (about TEDTalks) • The Joy of Stats • Journalism in the Age of Data
Data Science Team IdeasKeep it simple!Work on a real problem from work.Suggestions for more challenging problems:Census Return RateDevelop a statistical model to predict census mail return rates at the Census block group level of geography. The Census Bureau will use this model for planning purposes for the decennial census and for demographic sample surveys.Develop and evaluate different statistical approaches to proposing the best predictive model for geographic units. The intent is to improve current predictive analytics.
Data Science Team IdeasHierarchical load forecasting problem: backcasting and forecasting hourly loads (in kW) for a US utility with 20 zones.Backcast and forecast at both zonal level (20 series) and system (sum of the 20 zonal level series) level, totally 21 series. Data (loads of 20 zones and temperature of 11 stations) history ranges from the 1st hour of 2004/1/1 to the 6th hour of 2008/6/30. Given actual temperature history, the 8 weeks below in the load history are set to be missing and are required to be backcasted. Its OK to use the entire history to backcast these 8 weeks.2005/3/6 - 2005/3/12;2005/6/20 - 2005/6/26;2005/9/10 - 2005/9/16;2005/12/25 - 2005/12/31;2006/2/13 - 2006/2/19;2006/5/25 - 2006/5/31;2006/8/2 - 2006/8/8;2006/11/22 - 2006/11/28;Need to forecast hourly loads from 2008/7/1 to 2008/7/7. No actual temperatures
Data Science Team IdeasWind power forecasting problem: predicting hourly power generation up to 48 hours ahead at 7 wind farmsBased on historical measurements and additional wind forecast information (48-hour ahead predictions of wind speed and direction at the sites). The data is available for period ranging from the 1st hour of 2009/7/1 to the 12th hour of 2012/6/28.The period between 2009/7/1 and 2010/12/31 is a model identification and training period, while the remainder of the dataset, that is, from 2011/1/1 to 2012/6/28, is there for the evaluation. The training period is there to be used for designing and estimating models permiting to predicting wind power generation at lead times from 1 to 48 hours ahead, based on past power observations and/or available meteorological wind forecasts for that period. Over the evaluation part, it is aimed at mimicking real operational conditions. For that, a number of 48-hour periods with missing power observations where defined. All these power observations are to be predicted. These periods are defined as following. The first period with missing observations is that from 2011/1/1 at 01:00 until 2011/1/3 at 00:00. The second period with missing observations is that from 2011/1/4 at 13:00 until 2011/1/6 at 12:00. Note that to be consistent, only the meteorological forecasts for that period that would actually be available in practice are given. These two periods then repeats every 7 days until the end of the dataset. Inbetween periods with missing data, power observations are available for updating the models.
Data Science Team IdeasPredict the online sales of a consumer product based on a data set of product features.Build as good a model as possible to predict monthly online sales of a product. Imagine the products are online self-help programs following an initial advertising campaign.Obtain data in the comma separated values (CSV) format. Each row in this data set represents a different consumer product.The first 12 columns (Outcome_M1 through Outcome_M12) contains the monthly online sales for the first 12 months after the product launches.Date_1 is the day number the major advertising campaign began and the product launched.Date_2 is the day number the product was announced and a pre-release advertising campaign began.Other columns in the data set are features of the product and the advertising campaign. Quan_x are quantitative variables and Cat_x are categorical variables. Binary categorical variables are measured as (1) if the product had the feature and (0) if it did not.
Data Science Team IdeasImprove on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. Improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.The goal is to build a model that borrowers can use to help make the best financial decisions. Obtain historical data on 250,000 borrowers.
Data ToolsIncluded is a list of tools, such as programming languages and web-based utilities, data mining resources, some prominent organizations in the field, repositories where you can play with data, events you may want to attend and important articles you should take a look at.The second segment of the list includes a number of art and design resources the infographic designers might like including color palette generators and image searches. There are also some invisible web resources (if you’re looking for something on Google and not finding it) and metadata resources so you can appropriately curate your data.
Data ToolsGoogle Refine – A power tool for working with messy data (formerly Freebase Gridworks)The Overview Project – Overview is an open-source tool to help journalists find stories in large amounts of data by cleaning, visualizing and interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.Refine, reuse and request data | ScraperWiki – ScraperWiki is an online tool to make acquiring useful data simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.Data Curation Profiles – This website is an environment where academic librarians of all kinds, special librarians at research facilities, archivists involved in the preservation of digital data, and those who support digital repositories can find help, support and camaraderie in exploring avenues to learn more about working with research data and the use of the Data Curation Profiles Tool.Google Chart Tools – Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and server-side tools.22 free tools for data visualization and analysisThe R Journal – The R Journal is the refereed journal of the R project for statistical computing. It features short to medium length articles covering topics that might be of interest to users or developers of R.CS 229: Machine Learning – A widely referenced course by Professor Andrew Ng, CS 229: Machine Learning provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.Google Research Publication: BigTable – Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.Scientific Data Management – An introduction.Natural Language Toolkit – Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.Beautiful Soup – Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.Mondrian: Pentaho Analysis – Pentaho Open source analysis OLAP server written in Java. Enabling interactive analysis of very large datasets stored
Data ToolsThe Comprehensive R Archive Network - R is `GNU S’, a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.DataStax – Software, support, and training for Apache Cassandra.Machine Learning DemosVisual.ly – Infographics & Visualizations. Create, Share, ExploreGoogle Fusion Tables - Google Fusion Tables is a modern data management and publishing web application that makes it easyto host, manage, collaborate on, visualize, and publish data tables online.Tableau Software - Fast Analytics and Rapid-fire Business Intelligence from Tableau Software.WaveMaker - WaveMaker is a rapid application development environment for building, maintaining and modernizing business-critical Web 2.0 applications.Visualization: Annotated Time Line – Google Chart Tools – Google Code - An interactive time series line chart with optional annotations. The chart is rendered within the browser using Flash.Visualization: Motion Chart – Google Chart Tools – Google Code - A dynamic chart to explore several indicators over time. The chart is rendered within the browser using Flash.PhotoStats - Create gorgeous infographics about your iPhone photos.Ionz Ionz will help you craft an infographic about yourself.chart builder - Powerful tools for creating a variety of charts for online display.Creately - Online diagramming and design.Pixlr Editor - A powerful online photo editor.Google Public Data Explorer - The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and maps animate over time, the changes in the world become easier to understand. You don’t have to be a data expert to navigate between different views, make your own comparisons, and share your findings.Fathom -Fathom Information Design helps clients understand and express complex data through information graphics, interactive tools, and software for installations, the web, and mobile devices. Led by Ben Fry. Enough said!healthymagination | GE Data Visualization - Visualizations that advance the conversation about issues that shape our lives, and so we encourage visitors to download, post and share these visualizations.ggplot2 - ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
Data ToolsData Mining 1. Weka -nWeka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. 2. PSPP- PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions. The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.
Data Tools3. Rapid I- Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.
Data Tools4. R Project – R is a language and environment for statistical computing and graphics. It is a GNU projectwhich is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.R is available as Free Software under the terms of the Free Software Foundation‘s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
Data ToolsOrganizations 1. Data.gov 2. SDM group at LBNL 3. Open Archives Initiative 4. Code for America | A New Kind of Public Service 5. The # DataViz Daily 6. Institute for Advanced Analytics | North Carolina State University | Professor Michael Rappa · MSA Curriculum 7. BuzzData | Blog, 25 great links for data-lovin’ journalists 8. MetaOptimize – Home – Machine learning, natural language processing, predictive analytics, business intelligence, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization 9. had.co.nz 10. Measuring Measures – Measuring Measures
Data ToolsRepositories 1. Repositories | DataCite 2. Data | The World Bank 3. Infochimps Data Marketplace + Commons: Download Sell or Share Databases, statistics, datasets for free | Infochimps 4. Factual Home – Factual 5. Flowing Media: Your Data Has Something To Say 6. Chartsbin 7. Public Data Explorer 8. StatPlanet 9. ManyEyes 10. 25+ more ways to bring data into R
Data ToolsArticles 1. Data Science: a literature review | (R news & tutorials) 2. What is “Data Science” Anyway? 3. Hal Varian on how the Web challenges managers – McKinsey Quarterly – Strategy – Innovation 4. The Three Sexy Skills of Data Geeks « Dataspora 5. Rise of the Data Scientist 6. dataists » A Taxonomy of Data Science 7. The Data Science Venn Diagram « Zero Intelligence Agents 8. Revolutions: Growth in data-related jobs 9. Building data startups: Fast, big, and focused – O’Reilly Radar
Data ToolsArt Design 1. Periodic Table of Typefaces 2. Color Scheme Designer 3 3. Color Palette Generator Generate A Color Palette For Any Image 4. COLOURlovers 5. Colorbrewer: Color Advice for Maps
Data ToolsImage Searches 1. American Memory from the Library of Congress -The home page for the American Memory Historical Collections from the Library of Congress. American Memory provides free access to historical images, maps, sound recordings, and motion pictures that document the American experience. American Memory offers primary source materials that chronicle historical events, people, places, and ideas that continue to shape America. 2. Galaxy of Images | Smithsonian Institution Libraries 3. Flickr Search 4. 50 Websites For Free Vector Images Download -Design weblog for designers, bloggers and tech users. Covering useful tools, tutorials, tips and inspirational photos. 5. Images - Google Images. The most comprehensive image search on the web. 6. Trade Literature – a set on Flickr 7. Compfight / A Flickr Search Tool 8. morgueFile free photos for creatives by creatives 9. stock.xchng – the leading free stock photography site 10. The Ultimate Collection Of Free Vector Packs – Smashing Magazine 11. How to Create Animated GIFs Using Photoshop CS3 – wikiHow 12. IAN Symbol Libraries (Free Vector Symbols and Icons) – Integration and Application Network 13. Usability.gov 14. best icons 15. Iconspedia 16. IconFinder 17. IconSeeker
Data ToolsInvisible Web 1. 10 Search Engines to Explore the Invisible Web 2. Scirus – for scientific information - The most comprehensive scientific research tool on the web. With over 410 million scientific items indexed at last count, it allows researchers to search for not only journal content but also scientists’ homepages, courseware, pre-print server material, patents and institutional repository and website information. 3. TechXtra: Engineering, Mathematics, and Computing - TechXtra is a free service which can help you find articles, books, the best websites, the latest industry news, job announcements, technical reports, technical data, full text eprints, the latest research, thesis & dissertations, teaching and learning resources and more, in engineering, mathematics and computing. 4. Welcome to INFOMINE: Scholarly Internet Resource Collections - INFOMINE is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level. It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information. 5. The WWW Virtual Library - The WWW Virtual Library (VL) is the oldest catalogue of the Web, started by Tim Berners-Lee, the creator of HTML and of the Web itself, in 1991 at CERN in Geneva. Unlike commercial catalogues, it is run by a loose confederation of volunteers, who compile pages of key links for particular areas in which they are expert; even though it isn’t the biggest index of the Web, the VL pages are widely recognised as being amongst the highest-quality guides to particular sections of the Web. 6. Intute - Intute is a free online service that helps you to find web resources for your studies and research. With millions of resources available on the Internet, it can be difficult to find useful material. We have reviewed and evaluated thousands of resources to help you choose key websites in your subject. CompletePlanet – Discover over 70,000+ databases and specially search engines - There are hundreds of thousands of databases that contain Deep Web content. CompletePlanet is the front door to these Deep Web databases on the Web and to the thousands of regular search engines — it is the first step in trying to find highly topical information. By tracing through Infoplease: Encyclopedia, Almanac, Atlas, Biographies, Dictionary, Thesaurus. - Information Please has been providing authoritative answers to all kinds of factual questions since 1938—first as a popular radio quiz show, then starting in 1947 as an annual almanac, and since 1998 on the Internet at www.infoplease.com. Many things have changed since 1938, but not our dedication to providing reliable information, in a way that engages and entertains. 7. DeepPeep: discover the hidden web - DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 45,000 forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online databases and Web services. Advanced search allows you to perform more specific queries. Besides specifying keywords, you can also search for specific form element labels, i.e., the description of the form attributes. 8. IncyWincy: The Invisible Web Search Engine - IncyWincy is a showcase of Net Research Server (NRS) 5.0, a software product that provides a complete search portal solution, developed by LoopIP LLC. LoopIP licenses the NRS engine and provides consulting expertise in building search solutions.
Data ToolsMetadataDescription Schema: MODS (Library of Congress) and Outline of elements and attributes in MODS version 3.4: MetadataObject - This document contains a listing of elements and their related attributes in MODS Version 3.4 with values or value sources where applicable. It is an “outline” of the schema. Items highlighted in red indicate changes made to MODS in Version 3.4.All top-level elements and all attributes are optional, but you must have at least one element. Subelements are optional, although in some cases you may not have empty containers. Attributes are not in a mandated sequence and not repeatable (per XML rules). “Ordered” below means the subelements must occur in the order given. Elements are repeatable unless otherwise noted.”Authority” attributes are either followed by codes for authority lists (e.g., iso639-2b) or “see” references that link to documents that contain codes for identifying authority lists.For additional information about any MODS elements (version 3.4 elements will be added soon), please see the MODS User Guidelines.
Data Toolswiki.dbpedia.org : About - DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.
Data ToolsSemantic Web – W3C - In addition to the classic “Web of documents” W3C is helping to build a technology stack to support a “Web of data,” the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.RDA: Resource Description & Access | www.rdatoolkit.org - Designed for the digital world and an expanding universe of metadata users, RDA: Resource Description and Access is the new, unified cataloging standard. The online RDA Toolkit subscription is the most effective way to interact with the new standard. More on RDA.
Data ToolsCataloging Cultural Objects - A Guide to Describing Cultural Works and Their Images (CCO) is a manual for describing, documenting, and cataloging cultural works and their visual surrogates. The primary focus of CCO is art and architecture, including but not limited to paintings, sculpture, prints, manuscripts, photographs, built works, installations, and other visual media. CCO also covers many other types of cultural works, including archaeological sites, artifacts, and functional objects from the realm of material culture.Library of Congress Authorities (Search for Name, Subject, Title and Name/Title) - Using Library of Congress Authorities, you can browse and view authority headings for Subject, Name, Title and Name/Title combinations; and download authority records in MARC format for use in a local library system. This service is offered free of charge.Search Tools and Databases (Getty Research Institute) - Use these search tools to access library materials, specialized databases, and other digital resources.
Data ToolsArt & Architecture Thesaurus (Getty Research Institute) - Learn about the purpose, scope and structure of the AAT. The AAT is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the AAT’s contributors.Getty Thesaurus of Geographic Names (Getty Research Institute) Learn about the purpose, scope and structure of the TGN. The TGN is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the TGN’s contributors.DCMI Metadata TermsThe Digital Object Identifier SystemThe Federal Geographic Data Committee — Federal Geographic Data Committee
9 mistakes that will kill the best data analyses1. Sampling or design of experiment not properly done2. Non robust cross-validation3. Poor communication of results to management or clients4. Poor data visualization5. Does not solve our business problems6. Database misses important data or fields7. Failure to leverage external data8. Cant make business data silos to "talk to each other"9. Developers (production people) and designers speak "different languages"
Thank YouPresentation by:Michael WalkerRose Business Technologies720.firstname.lastname@example.org://www.rosebt.com