Your SlideShare is downloading. ×
0
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Data Scientist Toolbox
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Scientist Toolbox

1,964

Published on

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,964
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
53
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
  • 2. Me• Founder of Axemblr.com• Organizer of Bucharest JUG (bjug.ro)• Passion for DevOps, Data Analysis• Connect with me on LinkedIn
  • 3. @ Axemblr• Service Deployment Orchestration• Infrastructure Automation (DevOps)• Apache Hadoop On-Demand Appliance• Axemblr Provisionr https://github.com/axemblr/axemblr-provisionr
  • 4. (Big)Data in a nutshell• Business Intelligence / Research Evolved• Significant change in Decision Making• Enables new Products & Features• Enables new Business Models
  • 5. Data Scientist• Has a Business / Research oriented perspective• Knowledge of statistics & software engineering (AI, infrastructure)• Ability to explore questions and formulate hypotheses to be tested
  • 6. Data Science Project• Focused on particular business goals• Based on a set of important questions• Result > Answers that support business decisions
  • 7. The Algorithm• Find *Important* • Create Pipelines Questions • Automate & Deploy• Identify & Extract Data • Learn & Repeat!• Store & Sample• Analyse• Visualization
  • 8. Start w/ “Big” Questions ... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we find important mentions in social media?
  • 9. Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc.
  • 10. Extract Data... to a medium that allows you to run arbitrary queriesLocal filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
  • 11. Extract• Database dump tool, replicas or backups• External web services• Apache Sqoop (SQL-to-Hadoop)• Implement pipelines / real-time streams• Write custom tools as needed
  • 12. CurateUnfortunately Data is Messy
  • 13. Curate - Your Way• Use or develop tools / scripts• On large volumes there no obvious choices• Custom ways of filtering & aggregating large streams (e.g. twitter, sensors)• Reuse existing software components for data curation / validation
  • 14. DataWrangler Interactive System for Data cleaning a transformationhttp://vis.stanford.edu/wrangler/
  • 15. Open Refine Former Google Refinehttps://github.com/OpenRefine/ OpenRefine
  • 16. Sample (time, etc.)As needed to support interactive exploration
  • 17. Why Sample?• Interactive exploration to create and check assumptions, to create algorithms• Be careful with “Statistical Significance”• Sample Smart: By time, By location etc.
  • 18. Analyse Sample This is were the fun begins
  • 19. Analyse Sample• Create models• Create algorithms• Check hypotheses• Faster feedback loops & Immediate Gratification
  • 20. Excel-like
  • 21. Python
  • 22. RStudio
  • 23. Gephi.org
  • 24. Analyse Allapply your results to the entire data set
  • 25. How to Analyse All?• “Easy” on a single machine• Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc.• Key: Leverage existing tools• Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
  • 26. VisualizationCommunicate meaning w/ Graphics
  • 27. http://selection.datavisualization.ch/
  • 28. Automate & Deploy Make it part of your internal dashboard
  • 29. Learn & RepeatAnswer most of the time generate new questions
  • 30. Thanks! Questions? Andrei Savu / asavu@axemblr.com @andreisavu

×