Data Scientist Toolbox

  • 1,786 views
Uploaded on

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,786
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
46
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
  • 2. Me• Founder of Axemblr.com• Organizer of Bucharest JUG (bjug.ro)• Passion for DevOps, Data Analysis• Connect with me on LinkedIn
  • 3. @ Axemblr• Service Deployment Orchestration• Infrastructure Automation (DevOps)• Apache Hadoop On-Demand Appliance• Axemblr Provisionr https://github.com/axemblr/axemblr-provisionr
  • 4. (Big)Data in a nutshell• Business Intelligence / Research Evolved• Significant change in Decision Making• Enables new Products & Features• Enables new Business Models
  • 5. Data Scientist• Has a Business / Research oriented perspective• Knowledge of statistics & software engineering (AI, infrastructure)• Ability to explore questions and formulate hypotheses to be tested
  • 6. Data Science Project• Focused on particular business goals• Based on a set of important questions• Result > Answers that support business decisions
  • 7. The Algorithm• Find *Important* • Create Pipelines Questions • Automate & Deploy• Identify & Extract Data • Learn & Repeat!• Store & Sample• Analyse• Visualization
  • 8. Start w/ “Big” Questions ... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we find important mentions in social media?
  • 9. Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc.
  • 10. Extract Data... to a medium that allows you to run arbitrary queriesLocal filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
  • 11. Extract• Database dump tool, replicas or backups• External web services• Apache Sqoop (SQL-to-Hadoop)• Implement pipelines / real-time streams• Write custom tools as needed
  • 12. CurateUnfortunately Data is Messy
  • 13. Curate - Your Way• Use or develop tools / scripts• On large volumes there no obvious choices• Custom ways of filtering & aggregating large streams (e.g. twitter, sensors)• Reuse existing software components for data curation / validation
  • 14. DataWrangler Interactive System for Data cleaning a transformationhttp://vis.stanford.edu/wrangler/
  • 15. Open Refine Former Google Refinehttps://github.com/OpenRefine/ OpenRefine
  • 16. Sample (time, etc.)As needed to support interactive exploration
  • 17. Why Sample?• Interactive exploration to create and check assumptions, to create algorithms• Be careful with “Statistical Significance”• Sample Smart: By time, By location etc.
  • 18. Analyse Sample This is were the fun begins
  • 19. Analyse Sample• Create models• Create algorithms• Check hypotheses• Faster feedback loops & Immediate Gratification
  • 20. Excel-like
  • 21. Python
  • 22. RStudio
  • 23. Gephi.org
  • 24. Analyse Allapply your results to the entire data set
  • 25. How to Analyse All?• “Easy” on a single machine• Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc.• Key: Leverage existing tools• Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
  • 26. VisualizationCommunicate meaning w/ Graphics
  • 27. http://selection.datavisualization.ch/
  • 28. Automate & Deploy Make it part of your internal dashboard
  • 29. Learn & RepeatAnswer most of the time generate new questions
  • 30. Thanks! Questions? Andrei Savu / asavu@axemblr.com @andreisavu