Overkill Analytics

Overkill Analytics
Claudiu Barbura
VP of Engineering

• Architect and Dev Mgr at ubix.ai … data science platform
• Infrastructure & real-time services —-> Data Science at scale
• xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra)
• SeaScale my first ever meetup!
• Strata, Spark, C* summits & local meetups
About Me

• Ubix Data Eng & Science Platform Architecture
• High dimensional sparse feature spaces
• OKA (OverKill Analytics) and Composite Modelling
• (Kaggle)Outbrain Click Prediction: demo in DSL Workbench
• pymap deep dive: distributed scikit-learn through Spark
• python injection into DSL: pySpark scala JVM interop
• Q&A
Agenda

Data Eng & Science Platform: “Engine”
Unified big data technology stack (spark, cassandra, hadoop, kafka, es..)
Cloud agnostic architecture
Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF)
Extensible and integration via fluent and expressive API (DSL)
Enterprise grade: scalability, performance, high availability, geo-replication,
resilience, security, manageability, interoperability, testability

• high dimensional feature engineering demanda sparse representation
• spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand-
sparse, filter-sparse, load sparse (libsvm format)
• sparse vs dense: native input to mllib, spark.ml, scikit-learn algos
• exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow)
• feature (2-way) encoding + vocabulary extraction (error analysis, importance)
• Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text)
High dimensional sparse feature spaces

• OKA: “design philosophy for predictive models favors volume over precision, utility over
elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning
• Alternative to Dimensionality reduction - train on full sparse feature space!
• Composite Modeling = managing part models as one ensemble
• distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting
• unsupervised learning output -> input supervised learning (clustering + ensembling)`
• dimensionality reduction or building semantically different models within clusters
• OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models)
OKA (OverKill Analytics) & Composite Modelling

• Outbrain: content discovery platform … 250 billion personalized recommendations/month
• Kaggle: predict which recommended content each user will click?
• sample of users’ page views and clicks (14 days) .. sets of content recommendations
served to a specific user in a specific context +
• document metadata: mentioned entities (person, organization, location), a taxonomy of
categories, the topics mentioned, and the publisher.
• 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites
Outbrain Click Prediction

• primitives for model management (model + metadata)
• optimizations for clustering + composite modeling techniques
• compute partition size/count to avoid OOM (simple with static allocation of resources
(Mesos/Coarse Grained or YARN))
• wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway)
• python-scala interop through cached temp tables (registerTempTable)
pymap - distributed python

Overkill Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Overkill Analytics

Similar to Overkill Analytics (20)

More from Claudiu Barbura

More from Claudiu Barbura (6)

Recently uploaded

Recently uploaded (20)

Overkill Analytics

Editor's Notes