Platform for Data Scientists

•Download as PPTX, PDF•

1 like•538 views

datamantra

Platform for Data Scientists by Binu K, Architect, Analytics Platform, Subex

Data & Analytics

Platform for Data Scientists
Binu K, Architect Analytics Platform
www.subex.com
1

Data and Analytics
Capture
• Acquire, extract,
parse, aggregate
Analyze
• Feature Engineering,
Exploratory analysis
Modelling
• Machine learning,
Statistics,
Optimisation
Analytics Output
• Application to live
data - Trends,
Prediction
Communication of
Results
• Dashboards and
Reports
The process & pain areas
Time taken for data into insights – Few Months
3
60 – 75%
Credits : Forbes

Advantages
www.subex.com 4
Automate repeated routine jobs
• Data load
• Preprocessing
Maximum resource Utilization
• Scheduling job overnight
Focus more on business
• Look different use cases
• Solution areas
Integrated tool box
• Combine tools into one
environment

Expectations
Workbench
• Exploratory Data Analysis
• Advanced Modelling
• Distributed
Architecture
Bespoke Algorithms
• Customized ML algorithms
• Custom Approaches
Industrialization
• Packaged Analytics
Platform
www.subex.com 5

Work Bench
EDA
7
Querying capabilities
• Pointed queries
• Aggregations
• Partitioning
• Windowing
• Analytical functions
Descriptive Stats
• Univariate analysis
• Bivariate analysis
Predictive Modeling
• Building and testing
• Ensemble

Customization
• Decision Trees/Random Forests
• Handling categorical values
• Identify top reason
• Custom node labelling
• K-Means
• Weighted Distance
• Geospatial distance - Harvesine distance
• Social Network Analysis
• Build call network
• Community detection
• Influencer identification
Domain & scale
www.subex.com 9

Objective
www.subex.com 11
Pareto Analysis
Example
Selection of a limited subset which produces significant overall effect. Two
comparable metrics with unbalanced magnitudes of cause & effect are identified
Samples
• Smart phones constitute 27% of all handsets but contribute to 95% of all
mobile traffic
• 75% of the of the revenue is generated from 15% of distinct rate plans
• 10% of distinct problem areas are responsible for 83% of total complaints
Use cases
Can be used to identify impact of a causal metric on a outcome metric.

Recipe for Success
Regardless of what some software vendor advertisements may claim, you can’t
just purchase some Analytics software, install it, sit back, and watch it solve all
your problems.
Right combination of domain (business acumen) and analytics is required to
solve any business problem
www.subex.com 13
“There is a tendency of solving one’s problems by
means of much equipment rather than thought."
Alan Turing.

ROC® Insights
Technologies
www.subex.com 14
Data Ingestion Data Storage Modelling/Profiler Reporting

Thank You
binu.k@subex.com
www.subex.com
15

What's hot

Architecting for Real-Time Big Data AnalyticsRob Winters

Learn to Use Databricks for the Full ML LifecycleDatabricks

Eugene Polonichko "Architecture of modern data warehouse"Lviv Startup Club

HP Discover: Real Time Insights from Big DataRob Winters

Democratizing DataDatabricks

Introduction to Data EngineeringDurga Gadiraju

The modern analytics architectureJoseph D'Antoni

Data Architecture Brief OverviewHal Kalechofsky

Design Principles for a Modern Data WarehouseRob Winters

How To Buy Data WarehouseEric Sun

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks

Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...Databricks

Intuit Analytics Cloud 101DataWorks Summit/Hadoop Summit

Introducing MLflow for End-to-End Machine Learning on DatabricksDatabricks

What’s New with Databricks Machine LearningDatabricks

Raising Up Voters with Microsoft Azure CloudCCG

Accelerating Big Data AnalyticsAttunity

Building an Effective Data Warehouse ArchitectureJames Serra

Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Data

Empowering Real Time Patient Care Through Spark StreamingDatabricks

What's hot (20)

Architecting for Real-Time Big Data Analytics

Learn to Use Databricks for the Full ML Lifecycle

Eugene Polonichko "Architecture of modern data warehouse"

HP Discover: Real Time Insights from Big Data

Democratizing Data

Introduction to Data Engineering

The modern analytics architecture

Data Architecture Brief Overview

Design Principles for a Modern Data Warehouse

How To Buy Data Warehouse

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...

Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...

Intuit Analytics Cloud 101

Introducing MLflow for End-to-End Machine Learning on Databricks

What’s New with Databricks Machine Learning

Raising Up Voters with Microsoft Azure Cloud

Accelerating Big Data Analytics

Building an Effective Data Warehouse Architecture

Yellowbrick Webcast with DBTA for Real-Time Analytics

Empowering Real Time Patient Care Through Spark Streaming

Viewers also liked

Building scalable rest service using Akka HTTPdatamantra

Telco analytics at scaledatamantra

Interactive Data Analysis in Spark Streamingdatamantra

Functional programming in Scaladatamantra

Actian Vector WhitepaperEdgar Alejandro Villegas

Actian Analytics Platform - Hadoop SQL EditionAlessandro Salvatico

Data Science with Spark by Saeed Aghabozorgi Sachin Aggarwal

Analytics at the Speed of Thought: Actian Express Overview Actian Corporation

Jump start your analytics investments and accelerate analytics ROIActian Corporation

Turning Your Data Lake into Measurable Business ValueActian Corporation

Introduction to datasetdatamantra

Real time ETL processing using Spark streamingdatamantra

Productionalizing a spark applicationdatamantra

Anatomy of spark catalystdatamantra

Digital WorkspaceBearingPoint

Functional programming in ScalaDamian Jureczko

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC

Viewers also liked (17)

Building scalable rest service using Akka HTTP

Telco analytics at scale

Interactive Data Analysis in Spark Streaming

Functional programming in Scala

Actian Vector Whitepaper

Actian Analytics Platform - Hadoop SQL Edition

Data Science with Spark by Saeed Aghabozorgi

Analytics at the Speed of Thought: Actian Express Overview

Jump start your analytics investments and accelerate analytics ROI

Turning Your Data Lake into Measurable Business Value

Introduction to dataset

Real time ETL processing using Spark streaming

Productionalizing a spark application

Anatomy of spark catalyst

Digital Workspace

Functional programming in Scala

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Similar to Platform for Data Scientists

What Does Artificial Intelligence Have to Do with IT Operations?Precisely

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY

Neev Application Performance Management ServicesNeev Technologies

Data Analytics in Digital TransformationMukund Babbar

Artificial Intelligence Application in Oil and GasSparkCognition

Machine Data AnalyticsNicolas Morales

Operational AnalyticsEckerson Group

Platforming the Major Analytic Use Cases for Modern EngineeringDATAVERSITY

M Chambers and RapidMiner Overview for Babson classmcAnalytics99

Microstrategy OverviewRoberto Zerbini

Self Service Outline Updated 8 jsJulia Smith

NZS-4555 - IT Analytics Keynote - IT Analytics for the EnterpriseIBM z Systems Software - IT Service Management

Fractional Chief AI Officer Services For HireValue Amplify Consulting

Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati

Transpara Visual KPI Overview - May 2019Transpara

Better insight 2010 nov 30 bucharestDoina Draganescu

Application ModernizationSulaiman64

Business analytics and data visualisationShwetabh Jaiswal

Alten calsoft labs analytics service offeringsSandeep Vyas

OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi

Similar to Platform for Data Scientists (20)

What Does Artificial Intelligence Have to Do with IT Operations?

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...

Neev Application Performance Management Services

Data Analytics in Digital Transformation

Artificial Intelligence Application in Oil and Gas

Machine Data Analytics

Operational Analytics

Platforming the Major Analytic Use Cases for Modern Engineering

M Chambers and RapidMiner Overview for Babson class

Microstrategy Overview

Self Service Outline Updated 8 js

NZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise

Fractional Chief AI Officer Services For Hire

Building a Real-Time Security Application Using Log Data and Machine Learning...

Transpara Visual KPI Overview - May 2019

Better insight 2010 nov 30 bucharest

Application Modernization

Business analytics and data visualisation

Alten calsoft labs analytics service offerings

OC Big Data Monthly Meetup #6 - Session 1 - IBM

Recently uploaded

Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhAbortion pills in Riyadh +966572737505 get cytotec

一比一原版(UCD毕业证书）加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt

怎样办理旧金山城市学院毕业证（CCSF毕业证书）成绩单学校原版复制vexqp

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制vexqp

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

Data Analyst Tasks to do the internship.pdftheeltifs

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan

Gartner's Data Analytics Maturity Model.pptxchadhar227

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Recently uploaded (20)

Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh

一比一原版(UCD毕业证书）加州大学戴维斯分校毕业证成绩单原件一模一样

怎样办理旧金山城市学院毕业证（CCSF毕业证书）成绩单学校原版复制

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

Dubai Call Girls Peeing O525547819 Call Girls Dubai

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Aspirational Block Program Block Syaldey District - Almora

Data Analyst Tasks to do the internship.pdf

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...

Gartner's Data Analytics Maturity Model.pptx

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

Platform for Data Scientists

1. Platform for Data Scientists Binu K, Architect Analytics Platform www.subex.com 1

2. Why Platform? www.subex.com 2

3. Data and Analytics Capture • Acquire, extract, parse, aggregate Analyze • Feature Engineering, Exploratory analysis Modelling • Machine learning, Statistics, Optimisation Analytics Output • Application to live data - Trends, Prediction Communication of Results • Dashboards and Reports The process & pain areas Time taken for data into insights – Few Months 3 60 – 75% Credits : Forbes

4. Advantages www.subex.com 4 Automate repeated routine jobs • Data load • Preprocessing Maximum resource Utilization • Scheduling job overnight Focus more on business • Look different use cases • Solution areas Integrated tool box • Combine tools into one environment

5. Expectations Workbench • Exploratory Data Analysis • Advanced Modelling • Distributed Architecture Bespoke Algorithms • Customized ML algorithms • Custom Approaches Industrialization • Packaged Analytics Platform www.subex.com 5

6. Workbench www.subex.com 6

7. Work Bench EDA 7 Querying capabilities • Pointed queries • Aggregations • Partitioning • Windowing • Analytical functions Descriptive Stats • Univariate analysis • Bivariate analysis Predictive Modeling • Building and testing • Ensemble

8. Bespoke Algorithms www.subex.com 8

9. Customization • Decision Trees/Random Forests • Handling categorical values • Identify top reason • Custom node labelling • K-Means • Weighted Distance • Geospatial distance - Harvesine distance • Social Network Analysis • Build call network • Community detection • Influencer identification Domain & scale www.subex.com 9

10. Packaged Analytics www.subex.com 10

11. Objective www.subex.com 11 Pareto Analysis Example Selection of a limited subset which produces significant overall effect. Two comparable metrics with unbalanced magnitudes of cause & effect are identified Samples • Smart phones constitute 27% of all handsets but contribute to 95% of all mobile traffic • 75% of the of the revenue is generated from 15% of distinct rate plans • 10% of distinct problem areas are responsible for 83% of total complaints Use cases Can be used to identify impact of a causal metric on a outcome metric.

12. Private & Confidentialwww.subex.com ROC® Analytics & Insights Data Flow 12 Streaming & Batch Sources Structured ROC FMS ROC RA, ROC PS etc. Unstructured Logs, Tweets, DPI, Mobile App, ERP etc. Profiler Domain Guided Analytics Analytical Engine Distributed ML and Statistical Techniques Self Learning Continuous Feedback for Periodic Improvement Signal Hub Domain and Analytical Inputs Daily Profiles Profile for a day Profile Manager Master Profile Profile from many days Pareto Analysis Machine Learning & Statistics Libraries (Mllib, Scikit learn etc.) AP4 AP2 AP5 AP3 Many more….

13. Recipe for Success Regardless of what some software vendor advertisements may claim, you can’t just purchase some Analytics software, install it, sit back, and watch it solve all your problems. Right combination of domain (business acumen) and analytics is required to solve any business problem www.subex.com 13 “There is a tendency of solving one’s problems by means of much equipment rather than thought." Alan Turing.

14. ROC® Insights Technologies www.subex.com 14 Data Ingestion Data Storage Modelling/Profiler Reporting

15. Thank You binu.k@subex.com www.subex.com 15

16. Techomics Architecture 16

Editor's Notes

Majority of time taken is data cleansing. Reasons: The coding of the data is inconsistent (e.g. date is sometimes Day-Month-Year, and sometimes Month-Day-Year) Data is made available in separate tables, but merge keys for join are missing Dependent variables for the analysis are largely missing Many fields appear to contain wild (clearly impossible) values Ambiguity regarding whether a value is valid or missing (e.g. age is 99) The unit of observation in the data is not appropriate for analysis (e.g transaction level data but analysis is required at customer level) http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#57bc6a597f75
Query on profile and raw table; H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment. H2O’s core code is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP. The Rest API is used by H2O’s web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python). Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. With Sparkling Water, users can drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers http://blog.cloudera.com/blog/2015/10/how-to-build-a-machine-learning-app-using-sparkling-water-and-apache-spark/
Transform analytics insights to business insights Not just an algorithm. Infused with business contexts Customized to @ Telecom Scale Association - Both categorical – Cramers V; Catg & Conti : simple linear regression with categorical as explanatory variable - One-way ANOVA
The Pareto principle is a principle, named after economist Vilfredo Pareto, that specifies an unequal relationship between inputs and outputs.. It states that, for many events, roughly 80% of the effects come from 20% of the causes. ... Pareto developed both concepts in the context of the distribution of income and wealth among the population.

Platform for Data Scientists

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Platform for Data Scientists

Similar to Platform for Data Scientists (20)

More from datamantra

More from datamantra (20)

Recently uploaded

Recently uploaded (20)

Platform for Data Scientists

Editor's Notes