Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prototyping in the Data World - Data Scripting with Python


Published on

Talk at Hackbright Academy on September 9, 2014



About the Talk: Data is deeply integrated into decision making in modern companies. But businesses have difficulty understanding the glut of data they now create and consume. What is all this data, and how do we use it?
This talk will cover how to get started with data scripting, a powerful tool for prototyping data work. Data skills are crucial as engineering teams are responsible for discovering, housing, and extracting business-critical data. We’ll discuss how to find meaningful data, develop heuristics for normalizing and slicing data, and define useful data structures. We’ll utilize the pythonic Data Scientists’ favorite tools, numpy, pandas, and iPython. Finally, we’ll talk about data work in industry, and why your scripting skills will be a superpower in a data-rich world.

About the Speaker: Clare Corthell is a Data Scientist and Designer at Mattermark, a data-driven deal intelligence platform, where she builds technologies that quantify the growth of private companies. She is the originator of The Open Source Data Science Masters, a curriculum for learning Data Science. A Stanford-trained product designer and engineer, she's founded and worked with early-stage companies in the US, Europe, and East Africa. She’s up early pondering discovery algorithms, information design, diglossia, and education systems. Follow her at @clarecorthell.

Published in: Data & Analytics
  • Be the first to comment

Prototyping in the Data World - Data Scripting with Python

  1. 1. Prototyping in the Data World Clare Corthell ! Data @ Mattermark @clarecorthell
  2. 2. weirdest food you’ve ever eaten?
  3. 3. whale blubber ice cream, with blueberries. (mine)
  4. 4. Open Source Data Science Masters ! Mattermark Data Scientist Machine Learning Engineer about me
  5. 5. Mattermark Private Company Deal Intelligence Platform ! or, a huge spreadsheet full of live data about private companies of which you can ask questions my company
  6. 6. Today’s Goal
  7. 7. ask questions of data (what we think about all the time at Mattermark)
  8. 8. Why do we ask questions of our data?
  9. 9. Because we want to gain knowledge from data
  10. 10. @maebert
  11. 11. creating knowledge, understanding, & more data (after exploration)
  12. 12. Data Scientists turn data into knowledge by answering ambiguous questions such as How do we bucket companies by industry? What are those industries? Can we predict whether someone will start a company? Are there patterns that computers can see that humans can’t? What do Data Scientist do?
  13. 13. Turning Data into Knowledge How Data Scientists spend their time: • 80% on Cleaning > Munging > Exploration • 20% Experiments / Analysis / Machine Learning Exploration is important because it lets us determine what questions we might be able to answer with the data. Only then can we run experiments, analyze, and finally begin to fundamentally understand and model the world. What do Data Scientist do?
  14. 14. Exploration results in Prototypes definition When you explore and ask questions, you create knowledge prototype. (a first, probably incomplete version that leads to knowledge) ! Knowledge prototypes answer questions. (they might not perfectly model the world, but they’re a useful start) ! Questions lead to more questions, and subsequently more knowledge.
  15. 15. “All prototypes are wrong, but some prototypes are useful.” — blatant misquoting of George E. P. Box
  16. 16. by exploring data we start to answer questions by building knowledge prototypes lemme show you what I mean
  17. 17. What do we need to explore data? • Tools for working with that data (python!) • A data structure to make the data usable • Data • Questions we want to answer (we’ll make them up as we go today)
  18. 18. toolkit numpy pandas iPython multi-dimensional container of data data structures analysis tools browser-based code notebook / IDE (run blocks of code, not the whole program) python
  19. 19. the data structure: DataFrame and you thought you hated excel, but you actually don’t
  20. 20. dataframe • records are rows ! • columns are values across those rows ! • basic actions: filtering, sorting, slicing ! (paradigmatically not a far cry from excel) basic data structure
  21. 21. The Data (from Mattermark) • Categorical (industry) • Continuous (uniques) • Binary (mobile app) • Dates (date of funding) Company funding events in New York City from the last 5 years data types (examples)
  22. 22. Initial Questions of Exploration • What’s in here? • Are there patterns? • What might we find out if we investigate further? Exploration
  23. 23. From questions come more questions And eventually, you find something very, very interesting (and probably valuable!)
  24. 24. What’s in here? (sample 10 rows) iPython code block pd.read_csv(csvfilename)
  25. 25. What’s in here? (sample 1 row) .iloc[index_int]
  26. 26. What’s in here? (sample & describe 1 column) … df['colname'] df['colname'].describe()
  27. 27. What’s in here? (summary across columns) columns cont —> df.describe()
  28. 28. What’s in here? (sort by round size) … df.sort(‘colname’, ascending=False)
  29. 29. What’s not in here? (null or missing values) In the column, is the value at a given index null? (true or false) … Count the number of null values in the column df[‘colname'].isnull() len(np.where(df[‘colname’].isnull())[0])
  30. 30. Question: What is the most common stage for funding? to get a quick idea of scale… df['colname'].value_counts() df['colname'].value_counts().plot(kind='bar')
  31. 31. Leads to Question: What is the typical funding amount by round? Further questions: • What kind of companies raised at each stage? • How much variability is there in the amount raised at each stage? • Is this different from other geographies? groupby_var = df.groupby(‘colname') print groupby_var[‘colname’].mean().astype(int)
  32. 32. Question: How many of these are mobile companies? df.shape Further questions: • Do mobile companies have lots of employees? • Do mobile companies typically have revenue? • Do mobile companies raise less or more than other companies?
  33. 33. Question: How many of these are mobile companies? Further questions: ! • Do mobile companies have lots of employees? • Do mobile companies typically have revenue? • Do mobile companies raise less or more than other companies?
  34. 34. Our prototypes of knowledge: With regard to private companies in NYC that raised capital in the last 5 years: ! • ~10% have mobile applications • Most funding events were at the seed stage • The average seed round was $839k ! In total: ! • There were 3209 reported funding events what we discovered
  35. 35. Why it’s a prototype (eg, why we’re not done yet) • The data isn’t completely clean • We haven’t accounted for null, missing, zero values • We haven’t connected directly to a business question • We aren’t working in production (just locally)
  36. 36. by exploring data we start to answer questions with knowledge prototypes
  37. 37. Why does this matter? • Exploration lets us build prototypes of knowledge that start to answer real questions. • One question paves the road to another. • Answering questions leads to knowledge. • People who have knowledge understand more about the world.
  38. 38. Why does this matter? There aren’t enough people that do this with code.
  39. 39. Why does this matter? People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines. - Zed Shaw (Python the Hard Way)
  40. 40. daw.
  41. 41. Thank You! Best way to reach me? Twitter @clarecorthell psst — Mattermark is hiring! Come talk to me!