Data science technology overview

Data Science Technology Overview
SOOJUNG HONG (DEC 3. 2015)
Main Reference :
1. Big Data : The next frontier for innovation, competetion and productivity (Mckinsey Global Institute Report)
2. Big Data at Work (Harvard Business Review Press)
3. And many other Machine Learning literatures

Data Analytics Application Spectrum
Data
Model and Algorithm
Analytic Insight
(Big Data Technology)
(Machine Learning Technology)
(Visualization Technology)
Source : Analytics Driven Organization : Atos white paper

Data Science Innovation Funnel
 Define the different part of innovation funnel
Part 1 : Data researech & Hypothesis building
 Data Science
Part 2 : ML solution building & implementation
 ML Engineering
Part 3 : A/B Testing (if Web app) and Analysis
 Data Science
Source : http://www.slideshare.net/xamat/10-more-lessons-learned-from-building-machine-learning-systems
Data Analytics Applications

Right Technology
Each analytics application typically differs in data inputs
 Technologies required for processing can vary significantly
Storing and dynamically Analysing
Vast amount of unstructured data
Web and Social Media
Capture and Analysis
High Volume & Velocity
Telecommunication network signals
or
IT infrastructure systems logs
Task
Data Type
Source
Example A Example B

Part 1 : Big Data Technology
1. Data Source
 Multiple source : image, sound, context
 Internal data sources (financial performance data, ERP, PLM, CRM, GPS data etc.)
 External data sources (like social media, video, voice and plain text, biz-sector studies)
(ex) Data Example : Stream processing. Large real-time streams of event data for algorithmic trading in financial services,
RFID event processing applications, fraud detection, process monitoring, Location-based services in telco
2. Scalability
 Scale up
 Scale out
* 150 Exabyte of data in Healthcare (1 Exabyte =1,152,921,504,606,846,976 bytes)
* New York Stock Exchange capture 1 TB of trade information during each trading session

 BigTable
 Proprietary distributed database system built on the Google File System (GFS)
 Part of the inspiration for Hadoop
 Google Cloud Bigtable : public version of BigTable (May, 2015)
 Hadoop (MapReduce implementation)
 Open source framework for distributed storage and processing of large set of data across multiple computers
 Commercial vendors Hadoop (Cloudera Hadoop, Hortonworks Hadoop, EMC Hadoop, Microsoft Hadoop)
 Cloud
 public, hybrid and private cloud
 S3 (Amazon Simple Storage Space), Microsoft Azure Storage
Big Data Technology (1) : How to store data

Big Data Technology (2) : How to process data
 In-memory distributed computing environments
 Store, distribute and compute large amounts of data
 Spark, H2O or SAP Hana
 Data warehouse
 Specialized database optimized for reporting, often used for storing large amounts of structured data.
 Data is uploaded using ETL (extract, transform, and load), often generated using BI tools.
 Data mart
 Subset of a data warehouse, used to provide data to users usually through business intelligence tools.
 Dynamo. Proprietary distributed data storage system developed by Amazon.

Big Data Technology (3) : How to manage data
 Hbase
 An open source (free), distributed, non-relational database modeled on Google’s Big Table
 Column-oriented database management system that runs on top of HDFS
 HIVE
 Data warehouse infrastructure built on top of Hadoop
 Providing data summarization, query, and analysis (i.e. SQL query on top of Hadoop)
 Cassandra
 Open source database management system designed to handle huge amounts of structured data
 Distributed, Continuoulsy available, linear scale performance
 Originally developed at Facebook, now managed as a project of the Apache Software Foundation

Coexistance Stratedgy
 Majority of big company would not replace their existing systems based on Datawarehouse with big data solution.
 coexistence stratedgy
 Data warehouse can continue with ist standard workload, using data from legacy operating systems

Big Data Technology : Data Format
Non-Relational database (NoSQL Database)
 In contrast to relational database (which based on Table, Record, Column, Row)
 A database that does not store data in tables (rows and columns)
 Semi-structured data (NoSQL)
 Data do not conform to fixed fields but contain tags and other markers to separate data elements
 CSV, XML, HTML-tagged text and JSON documents are semi structured documents
 Unstructured data
 Data do not conform to fixed fields
 Free-form text (e.g., books, articles, body of e-mail messages)
 Untagged audio, image and video data

Part 2 : Machine Learning Technology
Algorithms : TECHNIQUES FOR ANALYZING BIG DATA
 All of the techniques we list here can be applied to Big Data
 In general, larger and more diverse datasets can be used to generate more numerous and insightful
results than smaller, less diverse ones
Lessons Learned : More data beats a clever algorithm!
- ‘A Few Useful Things to Know about Machine Learning’ by P. Domingos

Machine Learning Technology @Quora
 Millions of questions and answers
 Millions of users
 Thousands of Topics

Part 2 : (1) Machine Learning Technology
 Supervised learning
 A Set of ML technique (Decision Trees, K-NN, SVM, Naive Bayes, Neural Network, Logistic Regression)
 Infer a function or relationship from a set of training data
 Examples : classification and regression
 Unsupervised learning
 A set of ML technique (K-means, Mixture models, Hierarchical clustering)
 Finds hidden structure in unlabeled data.
 Example : Cluster analysis

 Association rule learning (Apriori Algorithm, Eclat algorithm)
 A set of techniques for discovering interesting relationships (“association rules”) among variables
 Can be applied in large databases (data warehouse)
 Application example : Market basket analysis (Walmart’s Diaper and Beer)
 Classification (Decision Tree, SVM)
 A set of techniques to identify the categories in which new data points belong, based on a training set
 Training set contains data points that have already been categorized.
 Application example : prediction of segment-specific customer behavior (e.g. churn rate)
 Cluster analysis (K-means)
 A statistical method for classifying objects that splits a diverse group into smaller groups of similar objects
 characteristics of similarity are not known in advance
.

 Data fusion and data integration
 A set of techniques that integrate and analyze data from multiple sources
 Develop insights in more efficient and potentially more accurate than single source of data
 Example : Data from social media by natural language processing + Real-time sales data
 Determine effects for a marketing campaign by having on customer sentiment and
purchasing behavior
 Ensemble learning (Bayes Optimal Classifier, Bagging, Boosting)
 Using multiple predictive models (each developed using statistics and/or machine learning)
 Obtain better predictive performance than could be obtained from any of the constituent models
 A type of supervised learning.

 Genetic algorithms
 A technique used for escoptimization that is inspired by the process of natural evolution
 Well-suited for solving nonlinear problems
 Examples : Applications include improving job scheduling in manufacturing
Optimizing the performance of an investment portfolio.
 Neural networks (Both Supervised and Unsupervised Learning)
 Computational models, inspired by the structure and workings of biological neural networks
 Well-suited for finding nonlinear patterns (pattern recognition and optimization)
 Examples : Identifying high-value customers that are at risk of leaving a particular company
Identifying fraudulent insurance claims.

 Network analysis
 Characterize relationships among discrete nodes in a graph or a network
 In social network analysis, analyze connections between individuals in a community or organization
 Examples : Identifying key opinion leaders to target for marketing
Identifying bottlenecks in enterprise information flows
 Optimization
 Numerical techniques to redesign complex systems and processes to improve their performance
 Objectives measures can be cost, speed, or reliability
 Genetic Algorithm can be used for optimization
 Examples : Application improving operational processes such as scheduling, routing

 Regression
 Techniques to determine how the value of the dependent variable changes when independent variables
is modified.
 Often used for forecasting or prediction
 Examples : Forecasting sales volumes based on various market and economic variables
Determining what manufacturing parameters most influence customer satisfaction
 Sentiment analysis (Opinion Mining)
 Application of natural language processing and other analytic techniques
 Identify and extract subjective information from source text material
 Key aspects identify the feature, aspect, or product about which a sentiment is being expressed
 Determining the type, “polarity” (i.e., positive, negative, or neutral) and the degree and strength
 Examples : Companies applying sentiment analysis to analyze social response on them

Example : Sentiment Classification Technique

 Spatial analysis (Multilayer Perceptron, Mixture of Experts, Support Vector Regression)
 Using location information
 Example : Manufacturing supply chain network performance analysis
 Simulation
 Modeling the behavior of complex systems used for forecasting, predicting and scenario planning
 Repeated random sampling, i.e., running thousands of simulations, each based on different assumptions
 Example : Assessing the likelihood of meeting financial targets given uncertainties
 Time series analysis (Mixture of Expert Model)
 Statistical and signal processing for analyzing sequences of data points (values) at successive times
 Extract meaningful characteristics from the data.
 Examples : Hourly value of a stock market
Predicting the number of people who will be diagnosed with an infectious disease.

Options : Experimentation vs Production
 Good intermediate option
1. ML Researcher experiment on iPython Notebooks using Python tools (scikit-learn, Theano)
2. Implement the optimized version
3. Implement the abstraction layer on top of optimized implementation so they can be accessed
from regular experimentation tools
 Must be aware of in Production
1. Value of Mode = Value it brings to the product
2. Bridge gap between the production design and ML algorithm
Machine Learning Infrastructure

ML Programming language & Tools
 R
 R Studio
 Popular package : dplyr, plyr, data.table, stringr, zoo, ggvis, caret (for machine learning)
 Python
 iPython Notebook, Spyder
 Popular package : pandas, SciPy, NumPy, scikit-learn (sklearn)
 Sklearn meet several of criteria such as speed, well-designed classes for handling data, models, and results
R vs. Python : http://blog.datacamp.com/r-or-python-for-data-analysis

Example :R library for Machine Learning
 Analyzing Data : mean, var, sd, min, max, sum, rowSum, colRum
 Prediction : predict
 Apriori : arules package, apriori
 Logistic Regression : glm
 K-Means clustering : kmeans
 K-Nearest Neighbor Classification : knn
 Naive Bayes : naiveBayes
 Decision Tree : rpart
 Support Vector Machine : svm

Challenge
Processing other types of data (Big Data!) and how to show efficiently the result of Big Data analysis
 Tag cloud
 Visualization of the text in the form of a tag cloud, i.e., a weighted visual list
 Words that appear most frequently are larger
 Helps the reader to quickly perceive the most salient concepts
Part 3 : (1) Visualization

 Clustergram
 Visualization technique used for cluster analysis
 Displaying how individual members of a dataset are assigned to clusters
 as a number of clusters increases
 History flow
 Visualize the evolution of a document as it is edited by multiple authors
 Time appears on the horizontal axis
 Contributions to to the text are on the vertical axis

 Spatial information flow
 Depicts spatial information flows
 Example : New York Talk Exchange
It shows the amount of Internet Protocol (IP) data flowing between NY and other cities
The size of the glow on a particular city location corresponds to the amount of IP traffic
Visualization allows us to determine quickly which cities are most closely connected to NY

Data science technology overview

More Related Content

What's hot

Viewers also liked

Similar to Data science technology overview

Recently uploaded

Data science technology overview