Data Scientist Toolbox

     Andrei Savu - Axemblr.com
         BigData.ro 2013
Me

• Founder of Axemblr.com
• Organizer of Bucharest JUG (bjug.ro)
• Passion for DevOps, Data Analysis

• Connect with me on LinkedIn
@ Axemblr
• Service Deployment Orchestration
• Infrastructure Automation (DevOps)

• Apache Hadoop On-Demand Appliance
• Axemblr Provisionr
  https://github.com/axemblr/axemblr-provisionr
(Big)Data in a nutshell

• Business Intelligence / Research Evolved

• Significant change in Decision Making
• Enables new Products & Features
• Enables new Business Models
Data Scientist
• Has a Business / Research oriented
  perspective
• Knowledge of statistics & software
  engineering (AI, infrastructure)


• Ability to explore questions and formulate
  hypotheses to be tested
Data Science Project

• Focused on particular business goals
• Based on a set of important questions

• Result > Answers that support business
  decisions
The Algorithm
• Find *Important*            •   Create Pipelines
    Questions
                              •   Automate & Deploy
•   Identify & Extract Data
                              •   Learn & Repeat!
•   Store & Sample

•   Analyse

•   Visualization
Start w/ “Big” Questions
          ... answer them with (Big)Data



How can we understand & improve the conversion rate?
     How can we increase customer satisfaction?
 How can we find important mentions in social media?
Identify Data Sources
      OR add more probes / sensors as needed



   Google Analytics,Web server logs, Mixpanel, Custom
application metrics, Mouse tracking, Facebook metrics etc.
Extract Data
... to a medium that allows you to run arbitrary queries



Local filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
Extract
• Database dump tool, replicas or backups
• External web services
• Apache Sqoop (SQL-to-Hadoop)

• Implement pipelines / real-time streams
• Write custom tools as needed
Curate
Unfortunately Data is Messy
Curate - Your Way
• Use or develop tools / scripts
• On large volumes there no obvious choices
• Custom ways of filtering & aggregating large
  streams (e.g. twitter, sensors)
• Reuse existing software components for
  data curation / validation
DataWrangler
  Interactive System for Data
   cleaning a transformation


http://vis.stanford.edu/wrangler/
Open Refine
    Former Google Refine



https://github.com/OpenRefine/
           OpenRefine
Sample (time, etc.)
As needed to support interactive exploration
Why Sample?

• Interactive exploration to create and check
  assumptions, to create algorithms
• Be careful with “Statistical Significance”
• Sample Smart: By time, By location etc.
Analyse Sample
 This is were the fun begins
Analyse Sample
• Create models
• Create algorithms
• Check hypotheses

• Faster feedback loops & Immediate
  Gratification
Excel-like
Python
RStudio
Gephi.org
Analyse All
apply your results to the entire data set
How to Analyse All?
• “Easy” on a single machine
• Go distributed w/ Hadoop, MPI, Storm,
  Oracle Exa* etc.
• Key: Leverage existing tools

• Tools: sed, awkSQL, RHadoop, Apache
  Hive, Pig, Cloudera Impala, MPI, Custom MR
Visualization
Communicate meaning w/ Graphics
http://selection.datavisualization.ch/
Automate & Deploy
 Make it part of your internal dashboard
Learn & Repeat
Answer most of the time generate new questions
Thanks! Questions?
  Andrei Savu / asavu@axemblr.com
            @andreisavu

Data Scientist Toolbox

  • 1.
    Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
  • 2.
    Me • Founder ofAxemblr.com • Organizer of Bucharest JUG (bjug.ro) • Passion for DevOps, Data Analysis • Connect with me on LinkedIn
  • 3.
    @ Axemblr • ServiceDeployment Orchestration • Infrastructure Automation (DevOps) • Apache Hadoop On-Demand Appliance • Axemblr Provisionr https://github.com/axemblr/axemblr-provisionr
  • 4.
    (Big)Data in anutshell • Business Intelligence / Research Evolved • Significant change in Decision Making • Enables new Products & Features • Enables new Business Models
  • 5.
    Data Scientist • Hasa Business / Research oriented perspective • Knowledge of statistics & software engineering (AI, infrastructure) • Ability to explore questions and formulate hypotheses to be tested
  • 6.
    Data Science Project •Focused on particular business goals • Based on a set of important questions • Result > Answers that support business decisions
  • 7.
    The Algorithm • Find*Important* • Create Pipelines Questions • Automate & Deploy • Identify & Extract Data • Learn & Repeat! • Store & Sample • Analyse • Visualization
  • 8.
    Start w/ “Big”Questions ... answer them with (Big)Data How can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we find important mentions in social media?
  • 9.
    Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Custom application metrics, Mouse tracking, Facebook metrics etc.
  • 10.
    Extract Data ... toa medium that allows you to run arbitrary queries Local filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
  • 11.
    Extract • Database dumptool, replicas or backups • External web services • Apache Sqoop (SQL-to-Hadoop) • Implement pipelines / real-time streams • Write custom tools as needed
  • 12.
  • 13.
    Curate - YourWay • Use or develop tools / scripts • On large volumes there no obvious choices • Custom ways of filtering & aggregating large streams (e.g. twitter, sensors) • Reuse existing software components for data curation / validation
  • 14.
    DataWrangler InteractiveSystem for Data cleaning a transformation http://vis.stanford.edu/wrangler/
  • 15.
    Open Refine Former Google Refine https://github.com/OpenRefine/ OpenRefine
  • 16.
    Sample (time, etc.) Asneeded to support interactive exploration
  • 17.
    Why Sample? • Interactiveexploration to create and check assumptions, to create algorithms • Be careful with “Statistical Significance” • Sample Smart: By time, By location etc.
  • 18.
    Analyse Sample Thisis were the fun begins
  • 19.
    Analyse Sample • Createmodels • Create algorithms • Check hypotheses • Faster feedback loops & Immediate Gratification
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    Analyse All apply yourresults to the entire data set
  • 25.
    How to AnalyseAll? • “Easy” on a single machine • Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc. • Key: Leverage existing tools • Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
  • 26.
  • 27.
  • 28.
    Automate & Deploy Make it part of your internal dashboard
  • 29.
    Learn & Repeat Answermost of the time generate new questions
  • 30.
    Thanks! Questions? Andrei Savu / asavu@axemblr.com @andreisavu