SlideShare a Scribd company logo
Overkill Analytics
Claudiu Barbura
VP of Engineering
• Architect and Dev Mgr at ubix.ai … data science platform
• Infrastructure & real-time services —-> Data Science at scale
• xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra)
• SeaScale my first ever meetup!
• Strata, Spark, C* summits & local meetups
About Me
• Ubix Data Eng & Science Platform Architecture
• High dimensional sparse feature spaces
• OKA (OverKill Analytics) and Composite Modelling
• (Kaggle)Outbrain Click Prediction: demo in DSL Workbench
• pymap deep dive: distributed scikit-learn through Spark
• python injection into DSL: pySpark scala JVM interop
• Q&A
Agenda
Data Eng & Science Platform: “Engine”
Unified big data technology stack (spark, cassandra, hadoop, kafka, es..)
Cloud agnostic architecture
Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF)
Extensible and integration via fluent and expressive API (DSL)
Enterprise grade: scalability, performance, high availability, geo-replication,
resilience, security, manageability, interoperability, testability
• high dimensional feature engineering demanda sparse representation
• spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand-
sparse, filter-sparse, load sparse (libsvm format)
• sparse vs dense: native input to mllib, spark.ml, scikit-learn algos
• exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow)
• feature (2-way) encoding + vocabulary extraction (error analysis, importance)
• Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text)
High dimensional sparse feature spaces
• OKA: “design philosophy for predictive models favors volume over precision, utility over
elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning
• Alternative to Dimensionality reduction - train on full sparse feature space!
• Composite Modeling = managing part models as one ensemble
• distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting
• unsupervised learning output -> input supervised learning (clustering + ensembling)`
• dimensionality reduction or building semantically different models within clusters
• OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models)
OKA (OverKill Analytics) & Composite Modelling
• Outbrain: content discovery platform … 250 billion personalized recommendations/month
• Kaggle: predict which recommended content each user will click?
• sample of users’ page views and clicks (14 days) .. sets of content recommendations
served to a specific user in a specific context +
• document metadata: mentioned entities (person, organization, location), a taxonomy of
categories, the topics mentioned, and the publisher.
• 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites
Outbrain Click Prediction
• primitives for model management (model + metadata)
• optimizations for clustering + composite modeling techniques
• compute partition size/count to avoid OOM (simple with static allocation of resources
(Mesos/Coarse Grained or YARN))
• wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway)
• python-scala interop through cached temp tables (registerTempTable)
pymap - distributed python
Thank you!
Q & A

More Related Content

What's hot

Big Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the CloudBig Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the Cloud
George Ang
 
MSBIP møde nr. 25 - Azure ML
MSBIP møde nr. 25 - Azure MLMSBIP møde nr. 25 - Azure ML
MSBIP møde nr. 25 - Azure ML
David Bojsen
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoT
SingleStore
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning Smackdown
Lynn Langit
 
Analytics in the Cloud
Analytics in the CloudAnalytics in the Cloud
Analytics in the Cloud
Ross McNeely
 
El camino a las Cloud Native Apps - Azure AI
El camino a las Cloud Native Apps - Azure AIEl camino a las Cloud Native Apps - Azure AI
El camino a las Cloud Native Apps - Azure AI
Plain Concepts
 
Enterprise Cloud Batch Processing Compared
Enterprise Cloud Batch Processing ComparedEnterprise Cloud Batch Processing Compared
Enterprise Cloud Batch Processing Compared
CMPUTE
 
_Search? Made Simple: Elastic + App Search
_Search? Made Simple: Elastic + App Search_Search? Made Simple: Elastic + App Search
_Search? Made Simple: Elastic + App Search
Elasticsearch
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
Mark Kromer
 
Vaibhav's Resume
Vaibhav's ResumeVaibhav's Resume
Vaibhav's Resume
Vaibhav Sachdeva
 
Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904
Mark Tabladillo
 
MLflow on and inside Azure
MLflow on and inside AzureMLflow on and inside Azure
MLflow on and inside Azure
Databricks
 
Azure AI platform - Automated ML workshop
Azure AI platform - Automated ML workshopAzure AI platform - Automated ML workshop
Azure AI platform - Automated ML workshop
Parashar Shah
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning Smackdown
Lynn Langit
 
Tableau Data Sheet | Whitepaper
Tableau Data Sheet | WhitepaperTableau Data Sheet | Whitepaper
Tableau Data Sheet | Whitepaper
Vasu S
 
Community day ppt_kinesisv1.0
Community day ppt_kinesisv1.0Community day ppt_kinesisv1.0
Community day ppt_kinesisv1.0
Sridevi Murugayen
 
Finding new Customers using D&B and Excel Power Query
Finding new Customers using D&B and Excel Power QueryFinding new Customers using D&B and Excel Power Query
Finding new Customers using D&B and Excel Power Query
Lynn Langit
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
Microsoft Tech Community
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
Dmitry Petukhov
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
Microsoft Tech Community
 

What's hot (20)

Big Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the CloudBig Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the Cloud
 
MSBIP møde nr. 25 - Azure ML
MSBIP møde nr. 25 - Azure MLMSBIP møde nr. 25 - Azure ML
MSBIP møde nr. 25 - Azure ML
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoT
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning Smackdown
 
Analytics in the Cloud
Analytics in the CloudAnalytics in the Cloud
Analytics in the Cloud
 
El camino a las Cloud Native Apps - Azure AI
El camino a las Cloud Native Apps - Azure AIEl camino a las Cloud Native Apps - Azure AI
El camino a las Cloud Native Apps - Azure AI
 
Enterprise Cloud Batch Processing Compared
Enterprise Cloud Batch Processing ComparedEnterprise Cloud Batch Processing Compared
Enterprise Cloud Batch Processing Compared
 
_Search? Made Simple: Elastic + App Search
_Search? Made Simple: Elastic + App Search_Search? Made Simple: Elastic + App Search
_Search? Made Simple: Elastic + App Search
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Vaibhav's Resume
Vaibhav's ResumeVaibhav's Resume
Vaibhav's Resume
 
Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904
 
MLflow on and inside Azure
MLflow on and inside AzureMLflow on and inside Azure
MLflow on and inside Azure
 
Azure AI platform - Automated ML workshop
Azure AI platform - Automated ML workshopAzure AI platform - Automated ML workshop
Azure AI platform - Automated ML workshop
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning Smackdown
 
Tableau Data Sheet | Whitepaper
Tableau Data Sheet | WhitepaperTableau Data Sheet | Whitepaper
Tableau Data Sheet | Whitepaper
 
Community day ppt_kinesisv1.0
Community day ppt_kinesisv1.0Community day ppt_kinesisv1.0
Community day ppt_kinesisv1.0
 
Finding new Customers using D&B and Excel Power Query
Finding new Customers using D&B and Excel Power QueryFinding new Customers using D&B and Excel Power Query
Finding new Customers using D&B and Excel Power Query
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
 

Viewers also liked

From Evergreen to Edible / Resource Listing
From Evergreen to Edible / Resource ListingFrom Evergreen to Edible / Resource Listing
From Evergreen to Edible / Resource Listing
Ramona Winkelbauer
 
Est10 evidencias cartas
Est10   evidencias cartasEst10   evidencias cartas
Est10 evidencias cartas
emmstone
 
Alkyd Resin
Alkyd ResinAlkyd Resin
Alkyd Resin
Amin Navabi
 
Gephi Tutorial Layouts
Gephi Tutorial LayoutsGephi Tutorial Layouts
Gephi Tutorial Layouts
Gephi Consortium
 
CQRS and EventSourcing
CQRS and EventSourcingCQRS and EventSourcing
CQRS and EventSourcing
DevOWL Meetup
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
jbellis
 
Gephi Tutorial Visualization
Gephi Tutorial VisualizationGephi Tutorial Visualization
Gephi Tutorial Visualization
Gephi Consortium
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 

Viewers also liked (8)

From Evergreen to Edible / Resource Listing
From Evergreen to Edible / Resource ListingFrom Evergreen to Edible / Resource Listing
From Evergreen to Edible / Resource Listing
 
Est10 evidencias cartas
Est10   evidencias cartasEst10   evidencias cartas
Est10 evidencias cartas
 
Alkyd Resin
Alkyd ResinAlkyd Resin
Alkyd Resin
 
Gephi Tutorial Layouts
Gephi Tutorial LayoutsGephi Tutorial Layouts
Gephi Tutorial Layouts
 
CQRS and EventSourcing
CQRS and EventSourcingCQRS and EventSourcing
CQRS and EventSourcing
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 
Gephi Tutorial Visualization
Gephi Tutorial VisualizationGephi Tutorial Visualization
Gephi Tutorial Visualization
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 

Similar to Overkill Analytics

Bertenthal
BertenthalBertenthal
Bertenthal
Jesse Lingeman
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
Alok Mohapatra
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
DATAVERSITY
 
MediaGlu and Mongo DB
MediaGlu and Mongo DBMediaGlu and Mongo DB
MediaGlu and Mongo DB
Sundar Nathikudi
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
David Chou
 
ASGARD Splunk Conf 2016
ASGARD Splunk Conf 2016ASGARD Splunk Conf 2016
ASGARD Splunk Conf 2016
Keith Kraus
 
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...
Lucas Jellema
 
Oow2016 review-13th october 2016
Oow2016 review-13th october 2016Oow2016 review-13th october 2016
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Matthew Vaughn
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
Ian Foster
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Online news popularity analysis
Online news popularity analysisOnline news popularity analysis
Online news popularity analysis
Ankur Vora
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
Mark Tabladillo
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
Turi, Inc.
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
Crate.io
 

Similar to Overkill Analytics (20)

Bertenthal
BertenthalBertenthal
Bertenthal
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
MediaGlu and Mongo DB
MediaGlu and Mongo DBMediaGlu and Mongo DB
MediaGlu and Mongo DB
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
 
ASGARD Splunk Conf 2016
ASGARD Splunk Conf 2016ASGARD Splunk Conf 2016
ASGARD Splunk Conf 2016
 
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...
Oracle OpenWorld 2016 Review - High Level Overview of major themes and grand ...
 
Oow2016 review-13th october 2016
Oow2016 review-13th october 2016Oow2016 review-13th october 2016
Oow2016 review-13th october 2016
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Online news popularity analysis
Online news popularity analysisOnline news popularity analysis
Online news popularity analysis
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 

More from Claudiu Barbura

Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
Claudiu Barbura
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
Claudiu Barbura
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Claudiu Barbura
 

More from Claudiu Barbura (6)

Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 

Recently uploaded

20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Overkill Analytics

  • 2. • Architect and Dev Mgr at ubix.ai … data science platform • Infrastructure & real-time services —-> Data Science at scale • xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra) • SeaScale my first ever meetup! • Strata, Spark, C* summits & local meetups About Me
  • 3. • Ubix Data Eng & Science Platform Architecture • High dimensional sparse feature spaces • OKA (OverKill Analytics) and Composite Modelling • (Kaggle)Outbrain Click Prediction: demo in DSL Workbench • pymap deep dive: distributed scikit-learn through Spark • python injection into DSL: pySpark scala JVM interop • Q&A Agenda
  • 4. Data Eng & Science Platform: “Engine” Unified big data technology stack (spark, cassandra, hadoop, kafka, es..) Cloud agnostic architecture Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF) Extensible and integration via fluent and expressive API (DSL) Enterprise grade: scalability, performance, high availability, geo-replication, resilience, security, manageability, interoperability, testability
  • 5.
  • 6.
  • 7. • high dimensional feature engineering demanda sparse representation • spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand- sparse, filter-sparse, load sparse (libsvm format) • sparse vs dense: native input to mllib, spark.ml, scikit-learn algos • exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow) • feature (2-way) encoding + vocabulary extraction (error analysis, importance) • Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text) High dimensional sparse feature spaces
  • 8. • OKA: “design philosophy for predictive models favors volume over precision, utility over elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning • Alternative to Dimensionality reduction - train on full sparse feature space! • Composite Modeling = managing part models as one ensemble • distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting • unsupervised learning output -> input supervised learning (clustering + ensembling)` • dimensionality reduction or building semantically different models within clusters • OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models) OKA (OverKill Analytics) & Composite Modelling
  • 9. • Outbrain: content discovery platform … 250 billion personalized recommendations/month • Kaggle: predict which recommended content each user will click? • sample of users’ page views and clicks (14 days) .. sets of content recommendations served to a specific user in a specific context + • document metadata: mentioned entities (person, organization, location), a taxonomy of categories, the topics mentioned, and the publisher. • 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites Outbrain Click Prediction
  • 10. • primitives for model management (model + metadata) • optimizations for clustering + composite modeling techniques • compute partition size/count to avoid OOM (simple with static allocation of resources (Mesos/Coarse Grained or YARN)) • wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway) • python-scala interop through cached temp tables (registerTempTable) pymap - distributed python

Editor's Notes

  1. Logical Architecture Diagram
  2. Physical Architecture Diagram