Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Full-stack
data sciencehow to be a one-person data team Greg Goltsov
Data Hacker
gregory.goltsov.info
@gregoltsov
3+ years in startups
Built systems
supporting 1mil+
users
Delivered to Fortune
10 companies
Greg Goltsov
Data Hacker
grego...
#
Small/medium data
Python vs R
Concepts > code
CS +
Physics
Games
dev
Data analyst/
engineer/viz/*
Data
Scientist
University Touch Surgery Appear Here
Data
Hacker
Web
dev
Full-stack dev
DBA
SysAdmin
DevOps
Data Analyst
Data Engineer
Team Lead
First contract, 1-man team
Engineering → science
Learning. Fast.
Invest in tools
that last
Basic
Postgres
Pandas
Scikit-learn
Notebooks
Luigi/Airflow
Bash
Git
Extra
Flask
AWS EC2
AWS Redshift
d3.js
ElasticSearch
Sp...
Data
is simple*
SQL
is simple
SQL
is everywhere
If you can,
use Postgres
CREATE TABLE events (
name varchar(200),
visitor_id varchar(200),
properties jsonb,
browser jsonb
);
Postgres JSONB
http:/...
CREATE TABLE events (
name varchar(200),
visitor_id varchar(200),
properties jsonb,
browser jsonb
);
Postgres JSONB
http:/...
INSERT INTO events VALUES (
'pageview', '1',
'{ "page": "/account" }',
'{ "name": "Chrome", "os": "Mac", "resolution":
{ "...
SELECT browser->>'name' AS browser,
count(browser)
FROM events
GROUP BY browser->>'name';
browser | count
---------+------...
your_db=# o 'path/to/export.csv'
your_db=# COPY (
SELECT *
...
) TO STDOUT WITH CSV HEADER;
Postgres CSV
WITH new_users AS (SELECT ...),
unverified_users_ids AS (SELECT ...)
SELECT COUNT(new_user.id)
FROM new_user
WHERE new_use...
Postgres products
Postgres products
Pandas
“R in Python”
DataFrame
Simple I/O
Plotting
Split/apply/combine
http://worrydream.com/LadderOfAbstraction
# plain python
col_C = []
for i, row in enumerate(col_A):
c = row + col_B[i]
col...
Like
making
data
tidy
http://www.forbes.com/sites/gilpress/2016/03/23/
data-preparation-most-time-consuming-least-
enjoyab...
Scikit-learn
Framework for rapid dev
Fit, transform, predict
NumPy
Analysis
is a DAG
get_data.sh &&
process_data.sh &&
publish_data.sh
Luigi/Airflow
Describe the
pipeline
declaratively
Explore literally
evaluation.py
domain.py
evaluation.py
domain.py
Start fast,
iterate faster
www.drivendata.github.io/cookiecutter-data-science
Rails new for data science
Notebooks are for exploration
Sane structure...
!"" Makefile <- Makefile with commands like `make data` or `make train`
!"" data
# !"" external <- Data from third party s...
www.dataset.readthedocs.io
Just write SQL
# connect, return rows as objects with attributes
db = dataset.connect(‘postgres...
# sklearn-pandas
mapper = DataFrameMapper([
(['country'], [sklearn.preprocessing.Imputer(),
sklearn.preprocessing.Standard...
Don’t guard,
empower instead
http://jhtart.deviantart.com/art/Castle-171525835
The goal is to turn data
into information, and
information into insight.
– Carly Fiorina, former HP CEO
Steps forward
www.github.com/
caesar0301/awesome-
public-datasets
www.dataquest.io/blog/
data-science-portfolio-
project
Meetups
Meet
Speak
Give back
Read
Keep up
DataTau
DataScienceWeekly
O’Reilly Data Newsletter
KDnuggets
Kaggle
Emacs
C++
games
dev
photography
Thanks!
Greg Goltsov
Data Hacker
gregory.goltsov.info
@gregoltsov
questions?
Slides at
bit.ly/gg-full-stack-data-science
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Upcoming SlideShare
Loading in …5
×

Full-Stack Data Science: How to be a One-person Data Team

1,595 views

Published on

Talk I gave at CognitionX meet about my experience with doing data science in startups, such as Touch Surgery and Appear Here.

Published in: Data & Analytics
  • Be the first to comment

Full-Stack Data Science: How to be a One-person Data Team

  1. 1. Full-stack data sciencehow to be a one-person data team Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  2. 2. 3+ years in startups Built systems supporting 1mil+ users Delivered to Fortune 10 companies Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  3. 3. # Small/medium data Python vs R Concepts > code
  4. 4. CS + Physics Games dev Data analyst/ engineer/viz/* Data Scientist University Touch Surgery Appear Here Data Hacker
  5. 5. Web dev Full-stack dev DBA SysAdmin DevOps Data Analyst Data Engineer Team Lead
  6. 6. First contract, 1-man team Engineering → science Learning. Fast.
  7. 7. Invest in tools that last
  8. 8. Basic Postgres Pandas Scikit-learn Notebooks Luigi/Airflow Bash Git Extra Flask AWS EC2 AWS Redshift d3.js ElasticSearch Spark
  9. 9. Data is simple*
  10. 10. SQL is simple
  11. 11. SQL is everywhere
  12. 12. If you can, use Postgres
  13. 13. CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb, browser jsonb ); Postgres JSONB http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  14. 14. CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb, browser jsonb ); Postgres JSONB http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’sMongoDB inmy Postgres!”
  15. 15. INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account" }', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  16. 16. SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name'; browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  17. 17. your_db=# o 'path/to/export.csv' your_db=# COPY ( SELECT * ... ) TO STDOUT WITH CSV HEADER; Postgres CSV
  18. 18. WITH new_users AS (SELECT ...), unverified_users_ids AS (SELECT ...) SELECT COUNT(new_user.id) FROM new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH
  19. 19. Postgres products
  20. 20. Postgres products
  21. 21. Pandas
  22. 22. “R in Python” DataFrame Simple I/O Plotting Split/apply/combine
  23. 23. http://worrydream.com/LadderOfAbstraction # plain python col_C = [] for i, row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How
  24. 24. Like making data tidy http://www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says
  25. 25. Scikit-learn
  26. 26. Framework for rapid dev Fit, transform, predict NumPy
  27. 27. Analysis is a DAG
  28. 28. get_data.sh && process_data.sh && publish_data.sh
  29. 29. Luigi/Airflow Describe the pipeline declaratively
  30. 30. Explore literally
  31. 31. evaluation.py domain.py
  32. 32. evaluation.py domain.py
  33. 33. Start fast, iterate faster
  34. 34. www.drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration Sane structure for collaboration https://drivendata.github.io/cookiecutter-data-science
  35. 35. !"" Makefile <- Makefile with commands like `make data` or `make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py
  36. 36. www.dataset.readthedocs.io Just write SQL # connect, return rows as objects with attributes db = dataset.connect(‘postgresql://…', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)
  37. 37. # sklearn-pandas mapper = DataFrameMapper([ (['country'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) # pipeline to convert DataFrame to ML representation pipeline = sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest()), ('random_forest', ensemble.RandomForestClassifier())]) # set up search space for the best model cv = grid_search.GridSearchCV(pipeline, param_grid=dict(…)) best_model = cv.best_estimator_
  38. 38. Don’t guard, empower instead http://jhtart.deviantart.com/art/Castle-171525835
  39. 39. The goal is to turn data into information, and information into insight. – Carly Fiorina, former HP CEO
  40. 40. Steps forward
  41. 41. www.github.com/ caesar0301/awesome- public-datasets www.dataquest.io/blog/ data-science-portfolio- project
  42. 42. Meetups Meet Speak Give back
  43. 43. Read Keep up DataTau DataScienceWeekly O’Reilly Data Newsletter KDnuggets Kaggle
  44. 44. Emacs C++ games dev photography
  45. 45. Thanks! Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov questions? Slides at bit.ly/gg-full-stack-data-science

×