• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data Scientist Toolbox

Data Scientist Toolbox



My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!



Total Views
Views on SlideShare
Embed Views



7 Embeds 76

https://twitter.com 24
http://www.linkedin.com 21
http://www.facebook.com 12
https://www.facebook.com 9
https://www.linkedin.com 7
http://instacurate.com 2
http://www.instacurate.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data Scientist Toolbox Data Scientist Toolbox Presentation Transcript

    • Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
    • Me• Founder of Axemblr.com• Organizer of Bucharest JUG (bjug.ro)• Passion for DevOps, Data Analysis• Connect with me on LinkedIn
    • @ Axemblr• Service Deployment Orchestration• Infrastructure Automation (DevOps)• Apache Hadoop On-Demand Appliance• Axemblr Provisionr https://github.com/axemblr/axemblr-provisionr
    • (Big)Data in a nutshell• Business Intelligence / Research Evolved• Significant change in Decision Making• Enables new Products & Features• Enables new Business Models
    • Data Scientist• Has a Business / Research oriented perspective• Knowledge of statistics & software engineering (AI, infrastructure)• Ability to explore questions and formulate hypotheses to be tested
    • Data Science Project• Focused on particular business goals• Based on a set of important questions• Result > Answers that support business decisions
    • The Algorithm• Find *Important* • Create Pipelines Questions • Automate & Deploy• Identify & Extract Data • Learn & Repeat!• Store & Sample• Analyse• Visualization
    • Start w/ “Big” Questions ... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we find important mentions in social media?
    • Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc.
    • Extract Data... to a medium that allows you to run arbitrary queriesLocal filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
    • Extract• Database dump tool, replicas or backups• External web services• Apache Sqoop (SQL-to-Hadoop)• Implement pipelines / real-time streams• Write custom tools as needed
    • CurateUnfortunately Data is Messy
    • Curate - Your Way• Use or develop tools / scripts• On large volumes there no obvious choices• Custom ways of filtering & aggregating large streams (e.g. twitter, sensors)• Reuse existing software components for data curation / validation
    • DataWrangler Interactive System for Data cleaning a transformationhttp://vis.stanford.edu/wrangler/
    • Open Refine Former Google Refinehttps://github.com/OpenRefine/ OpenRefine
    • Sample (time, etc.)As needed to support interactive exploration
    • Why Sample?• Interactive exploration to create and check assumptions, to create algorithms• Be careful with “Statistical Significance”• Sample Smart: By time, By location etc.
    • Analyse Sample This is were the fun begins
    • Analyse Sample• Create models• Create algorithms• Check hypotheses• Faster feedback loops & Immediate Gratification
    • Excel-like
    • Python
    • RStudio
    • Gephi.org
    • Analyse Allapply your results to the entire data set
    • How to Analyse All?• “Easy” on a single machine• Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc.• Key: Leverage existing tools• Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
    • VisualizationCommunicate meaning w/ Graphics
    • http://selection.datavisualization.ch/
    • Automate & Deploy Make it part of your internal dashboard
    • Learn & RepeatAnswer most of the time generate new questions
    • Thanks! Questions? Andrei Savu / asavu@axemblr.com @andreisavu