Data Scientist Toolbox

Data Scientist Toolbox

Andrei Savu - Axemblr.com
BigData.ro 2013

Me

• Founder of Axemblr.com
• Organizer of Bucharest JUG (bjug.ro)
• Passion for DevOps, Data Analysis

• Connect with me on LinkedIn

@ Axemblr
• Service Deployment Orchestration
• Infrastructure Automation (DevOps)

• Apache Hadoop On-Demand Appliance
• Axemblr Provisionr
https://github.com/axemblr/axemblr-provisionr

(Big)Data in a nutshell

• Business Intelligence / Research Evolved

• Signiﬁcant change in Decision Making
• Enables new Products & Features
• Enables new Business Models

Data Scientist
• Has a Business / Research oriented
perspective
• Knowledge of statistics & software
engineering (AI, infrastructure)

• Ability to explore questions and formulate
hypotheses to be tested

Data Science Project

• Focused on particular business goals
• Based on a set of important questions

• Result > Answers that support business
decisions

The Algorithm
• Find *Important* • Create Pipelines
Questions
• Automate & Deploy
• Identify & Extract Data
• Learn & Repeat!
• Store & Sample

• Analyse

• Visualization

Start w/ “Big” Questions
... answer them with (Big)Data

How can we understand & improve the conversion rate?
How can we increase customer satisfaction?
How can we ﬁnd important mentions in social media?

Identify Data Sources
OR add more probes / sensors as needed

Google Analytics,Web server logs, Mixpanel, Custom
application metrics, Mouse tracking, Facebook metrics etc.

Extract Data
... to a medium that allows you to run arbitrary queries

Local ﬁlesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig

Extract
• Database dump tool, replicas or backups
• External web services
• Apache Sqoop (SQL-to-Hadoop)

• Implement pipelines / real-time streams
• Write custom tools as needed

Curate
Unfortunately Data is Messy

Curate - Your Way
• Use or develop tools / scripts
• On large volumes there no obvious choices
• Custom ways of ﬁltering & aggregating large
streams (e.g. twitter, sensors)
• Reuse existing software components for
data curation / validation

DataWrangler
Interactive System for Data
cleaning a transformation

http://vis.stanford.edu/wrangler/

Open Refine
Former Google Refine

https://github.com/OpenRefine/
OpenRefine

Sample (time, etc.)
As needed to support interactive exploration

Why Sample?

• Interactive exploration to create and check
assumptions, to create algorithms
• Be careful with “Statistical Signiﬁcance”
• Sample Smart: By time, By location etc.

Analyse Sample
This is were the fun begins

Analyse Sample
• Create models
• Create algorithms
• Check hypotheses

• Faster feedback loops & Immediate
Gratiﬁcation

Analyse All
apply your results to the entire data set

How to Analyse All?
• “Easy” on a single machine
• Go distributed w/ Hadoop, MPI, Storm,
Oracle Exa* etc.
• Key: Leverage existing tools

• Tools: sed, awkSQL, RHadoop, Apache
Hive, Pig, Cloudera Impala, MPI, Custom MR

Visualization
Communicate meaning w/ Graphics

http://selection.datavisualization.ch/

Automate & Deploy
Make it part of your internal dashboard

Learn & Repeat
Answer most of the time generate new questions

Thanks! Questions?
Andrei Savu / asavu@axemblr.com
@andreisavu

Data Scientist Toolbox

More Related Content

What's hot

Similar to Data Scientist Toolbox

More from Andrei Savu

Recently uploaded

Data Scientist Toolbox