(Big)Data in a nutshell• Business Intelligence / Research Evolved• Signiﬁcant change in Decision Making• Enables new Products & Features• Enables new Business Models
Data Scientist• Has a Business / Research oriented perspective• Knowledge of statistics & software engineering (AI, infrastructure)• Ability to explore questions and formulate hypotheses to be tested
Data Science Project• Focused on particular business goals• Based on a set of important questions• Result > Answers that support business decisions
The Algorithm• Find *Important* • Create Pipelines Questions • Automate & Deploy• Identify & Extract Data • Learn & Repeat!• Store & Sample• Analyse• Visualization
Start w/ “Big” Questions ... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we ﬁnd important mentions in social media?
Identify Data Sources OR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc.
Extract Data... to a medium that allows you to run arbitrary queriesLocal ﬁlesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
Extract• Database dump tool, replicas or backups• External web services• Apache Sqoop (SQL-to-Hadoop)• Implement pipelines / real-time streams• Write custom tools as needed
Curate - Your Way• Use or develop tools / scripts• On large volumes there no obvious choices• Custom ways of ﬁltering & aggregating large streams (e.g. twitter, sensors)• Reuse existing software components for data curation / validation
DataWrangler Interactive System for Data cleaning a transformationhttp://vis.stanford.edu/wrangler/
Open Reﬁne Former Google Reﬁnehttps://github.com/OpenReﬁne/ OpenReﬁne
Sample (time, etc.)As needed to support interactive exploration
Why Sample?• Interactive exploration to create and check assumptions, to create algorithms• Be careful with “Statistical Signiﬁcance”• Sample Smart: By time, By location etc.