0
Data Scientist Toolbox     Andrei Savu - Axemblr.com         BigData.ro 2013
Me• Founder of Axemblr.com• Organizer of Bucharest JUG (bjug.ro)• Passion for DevOps, Data Analysis• Connect with me on Li...
@ Axemblr• Service Deployment Orchestration• Infrastructure Automation (DevOps)• Apache Hadoop On-Demand Appliance• Axembl...
(Big)Data in a nutshell• Business Intelligence / Research Evolved• Significant change in Decision Making• Enables new Produ...
Data Scientist• Has a Business / Research oriented  perspective• Knowledge of statistics & software  engineering (AI, infr...
Data Science Project• Focused on particular business goals• Based on a set of important questions• Result > Answers that s...
The Algorithm• Find *Important*            •   Create Pipelines    Questions                              •   Automate & D...
Start w/ “Big” Questions          ... answer them with (Big)DataHow can we understand & improve the conversion rate?     H...
Identify Data Sources      OR add more probes / sensors as needed   Google Analytics,Web server logs, Mixpanel, Customappl...
Extract Data... to a medium that allows you to run arbitrary queriesLocal filesystem, Databases, Hadoop, HBase, HDFS, Hive,...
Extract• Database dump tool, replicas or backups• External web services• Apache Sqoop (SQL-to-Hadoop)• Implement pipelines...
CurateUnfortunately Data is Messy
Curate - Your Way• Use or develop tools / scripts• On large volumes there no obvious choices• Custom ways of filtering & ag...
DataWrangler  Interactive System for Data   cleaning a transformationhttp://vis.stanford.edu/wrangler/
Open Refine    Former Google Refinehttps://github.com/OpenRefine/           OpenRefine
Sample (time, etc.)As needed to support interactive exploration
Why Sample?• Interactive exploration to create and check  assumptions, to create algorithms• Be careful with “Statistical ...
Analyse Sample This is were the fun begins
Analyse Sample• Create models• Create algorithms• Check hypotheses• Faster feedback loops & Immediate  Gratification
Excel-like
Python
RStudio
Gephi.org
Analyse Allapply your results to the entire data set
How to Analyse All?• “Easy” on a single machine• Go distributed w/ Hadoop, MPI, Storm,  Oracle Exa* etc.• Key: Leverage ex...
VisualizationCommunicate meaning w/ Graphics
http://selection.datavisualization.ch/
Automate & Deploy Make it part of your internal dashboard
Learn & RepeatAnswer most of the time generate new questions
Thanks! Questions?  Andrei Savu / asavu@axemblr.com            @andreisavu
Upcoming SlideShare
Loading in...5
×

Data Scientist Toolbox

2,052

Published on

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

Published in: Technology

Transcript of "Data Scientist Toolbox"

  1. 1. Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
  2. 2. Me• Founder of Axemblr.com• Organizer of Bucharest JUG (bjug.ro)• Passion for DevOps, Data Analysis• Connect with me on LinkedIn
  3. 3. @ Axemblr• Service Deployment Orchestration• Infrastructure Automation (DevOps)• Apache Hadoop On-Demand Appliance• Axemblr Provisionr https://github.com/axemblr/axemblr-provisionr
  4. 4. (Big)Data in a nutshell• Business Intelligence / Research Evolved• Significant change in Decision Making• Enables new Products & Features• Enables new Business Models
  5. 5. Data Scientist• Has a Business / Research oriented perspective• Knowledge of statistics & software engineering (AI, infrastructure)• Ability to explore questions and formulate hypotheses to be tested
  6. 6. Data Science Project• Focused on particular business goals• Based on a set of important questions• Result > Answers that support business decisions
  7. 7. The Algorithm• Find *Important* • Create Pipelines Questions • Automate & Deploy• Identify & Extract Data • Learn & Repeat!• Store & Sample• Analyse• Visualization
  8. 8. Start w/ “Big” Questions ... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we find important mentions in social media?
  9. 9. Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc.
  10. 10. Extract Data... to a medium that allows you to run arbitrary queriesLocal filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
  11. 11. Extract• Database dump tool, replicas or backups• External web services• Apache Sqoop (SQL-to-Hadoop)• Implement pipelines / real-time streams• Write custom tools as needed
  12. 12. CurateUnfortunately Data is Messy
  13. 13. Curate - Your Way• Use or develop tools / scripts• On large volumes there no obvious choices• Custom ways of filtering & aggregating large streams (e.g. twitter, sensors)• Reuse existing software components for data curation / validation
  14. 14. DataWrangler Interactive System for Data cleaning a transformationhttp://vis.stanford.edu/wrangler/
  15. 15. Open Refine Former Google Refinehttps://github.com/OpenRefine/ OpenRefine
  16. 16. Sample (time, etc.)As needed to support interactive exploration
  17. 17. Why Sample?• Interactive exploration to create and check assumptions, to create algorithms• Be careful with “Statistical Significance”• Sample Smart: By time, By location etc.
  18. 18. Analyse Sample This is were the fun begins
  19. 19. Analyse Sample• Create models• Create algorithms• Check hypotheses• Faster feedback loops & Immediate Gratification
  20. 20. Excel-like
  21. 21. Python
  22. 22. RStudio
  23. 23. Gephi.org
  24. 24. Analyse Allapply your results to the entire data set
  25. 25. How to Analyse All?• “Easy” on a single machine• Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc.• Key: Leverage existing tools• Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
  26. 26. VisualizationCommunicate meaning w/ Graphics
  27. 27. http://selection.datavisualization.ch/
  28. 28. Automate & Deploy Make it part of your internal dashboard
  29. 29. Learn & RepeatAnswer most of the time generate new questions
  30. 30. Thanks! Questions? Andrei Savu / asavu@axemblr.com @andreisavu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×