SlideShare a Scribd company logo
1 of 37
Download to read offline
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  Data	
  Journey	
  with	
  Python	
  
Wes	
  McKinney	
  @wesmckinn	
  
SciPy	
  2015	
  Keynote,	
  2015-­‐07-­‐09	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  2007-­‐present,	
  from	
  my	
  perspecOve	
  
•  CelebraOng	
  our	
  successes	
  
•  Challenges	
  and	
  opportuniOes	
  for	
  the	
  future	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  are	
  we	
  all	
  here?	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  pre-­‐2007	
  existence	
  
•  I	
  was	
  a	
  mathemaOcian!	
  
•  No	
  exposure	
  to	
  Python,	
  SQL,	
  R	
  (or	
  any	
  analyOcs	
  for	
  that	
  maYer)	
  
•  Rude	
  awakening	
  ahead	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  first	
  job:	
  AQR	
  (quant	
  hedge	
  fund)	
  
•  A	
  quant	
  finance	
  operaOon	
  that	
  lived	
  and	
  breathed	
  SQL	
  and	
  Excel	
  
•  ProducOon	
  systems	
  in	
  C++,	
  Java,	
  Visual	
  BASIC,	
  and	
  C#	
  .NET	
  
•  Some	
  PhD-­‐level	
  researchers	
  used	
  MATLAB	
  for	
  research	
  (as	
  was	
  common	
  in	
  
finance	
  /	
  economics	
  departments)	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
ProducOvity	
  frustraOons	
  
•  First	
  year:	
  several	
  analyOcs	
  and	
  staOsOcal	
  data	
  analysis	
  projects	
  
• A	
  huge	
  amount	
  of	
  SQL	
  
• Some	
  Java	
  
• A	
  liYle	
  bit	
  of	
  R	
  
• …	
  and	
  TONS	
  of	
  Excel	
  
•  Projects	
  felt	
  like	
  5%	
  conceptualizaOon,	
  95%	
  tedium	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  in	
  early	
  2008:	
  different	
  Omes	
  
•  A	
  bleeding	
  edge	
  stack	
  
• NumPy	
  1.0.4	
  
• SciPy	
  0.6.0	
  
• matplotlib	
  0.91.2	
  
• IPython	
  0.8.4,	
  SVN	
  history	
  begins	
  2/2008	
  
• Cython	
  0.9.8	
  
•  The	
  scienOfic	
  Python	
  community	
  seemed	
  mainly	
  focused	
  on	
  aYracOng	
  MATLAB,	
  
HPC,	
  and	
  scienOfic	
  lab	
  users	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2008:	
  Things	
  SciPythonistas	
  didn’t	
  care	
  too	
  much	
  about	
  
•  RelaOonal	
  data	
  or	
  SQL	
  
•  Missing	
  data	
  handling	
  (outside	
  numpy.ma)	
  
•  StaOsOcs	
  and	
  econometrics	
  (first	
  statsmodels	
  release:	
  2011)	
  
•  StaOsOcal	
  graphics	
  
•  Machine	
  learning	
  (scikit-­‐learn	
  0.1:	
  2/2010)	
  
•  AnalyOcs	
  and	
  business	
  intelligence	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Taking	
  a	
  gamble	
  
•  Decided	
  to	
  give	
  Python	
  a	
  shot	
  for	
  AQR	
  projects	
  aoer	
  seeing	
  part	
  of	
  MASS	
  R	
  
package	
  ported	
  in	
  scipy.stats.models	
  by	
  Jonathan	
  Taylor	
  at	
  Stanford	
  
•  proto-­‐pandas	
  first	
  version	
  built	
  in	
  April	
  2008	
  
• Focused	
  on	
  porOng	
  an	
  R	
  project	
  to	
  Python	
  
•  May	
  ‘08:	
  Embedded	
  Python	
  interpreter	
  in	
  a	
  legacy	
  C++	
  system	
  
•  5/2008	
  –	
  12/2008:	
  Skunkworks	
  Python	
  ports	
  and	
  evangelism	
  across	
  company	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  did	
  Python	
  work	
  out?	
  
•  BaYeries	
  included	
  
•  Interoperability	
  with	
  C++	
  
• Embedding	
  Python	
  interpreter	
  
• Wrapping	
  C++	
  in	
  Python	
  C	
  extensions	
  
•  ProducOve	
  user	
  interface	
  
• Python	
  language	
  
• IPython	
  +	
  matplotlib	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  cool	
  things	
  we	
  built	
  
•  A	
  global	
  macro	
  risk	
  modeling	
  system	
  (using	
  pandas	
  +	
  NumPy	
  +	
  PyTables)	
  
•  A	
  heterogeneous	
  market	
  data	
  loading	
  and	
  cleaning	
  system	
  
•  A	
  task-­‐based	
  cluster	
  compuOng	
  system	
  (similar	
  to	
  Celery)	
  
•  Tick	
  data	
  storage	
  and	
  analyOcs	
  	
  
•  Various	
  GUIs	
  with	
  wxPython	
  +	
  matplotlib	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
End	
  2009:	
  pandas!	
  
•  AQR	
  lets	
  me	
  open	
  source	
  pandas	
  0.1	
  on	
  Christmas,	
  2009.	
  
~/Downloads/pandas-­‐0.1	
  $	
  cloc	
  -­‐-­‐exclude-­‐ext	
  pandas	
  
	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
Language	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  files	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  blank	
  	
  	
  	
  	
  	
  	
  	
  comment	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  code	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
Python	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  41	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3124	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2933	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  8225	
  
Cython	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  7	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  418	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  93	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1247	
  
C/C++	
  Header	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
SUM:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  49	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3542	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3026	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  9473	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2010	
  –	
  2011:	
  Python’s	
  data	
  growing	
  pains	
  
•  pandas	
  did	
  not	
  evolve	
  much	
  aoer	
  its	
  iniOal	
  release	
  
•  No	
  consensus	
  or	
  momentum	
  behind	
  any	
  project	
  for	
  analyOcs	
  /	
  data	
  wrangling	
  
•  AQR	
  —>	
  Duke	
  StaOsOcal	
  Science	
  
•  AQR	
  sponsors	
  bug	
  fixes	
  and	
  new	
  features	
  in	
  pandas	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
May	
  2011:	
  Gevng	
  inspired	
  
•  2011-­‐05-­‐13:	
  Enthought	
  Datarray	
  Summit	
  
• Discuss	
  how	
  to	
  enable	
  Python	
  to	
  become	
  more	
  useful	
  staOsOcal	
  compuOng	
  
• Me:	
  “Library	
  fragmentaOon	
  is	
  destrucOve;	
  integraOon	
  is	
  beYer”	
  
• Data	
  structures,	
  missing	
  data,	
  and	
  data	
  wrangling	
  tools	
  
•  2011-­‐05-­‐23	
  –	
  2011-­‐06-­‐03	
  :	
  Python	
  finance	
  consulOng	
  engagement	
  
• Realized	
  that	
  Python	
  data	
  tools	
  sorely	
  needed	
  in	
  industry	
  
• But	
  not	
  nearly	
  mature	
  enough	
  yet	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2011-­‐05-­‐30	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Tell	
  me	
  about	
  your	
  use	
  cases	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Making	
  pandas	
  a	
  beYer	
  tool	
  
•  ConsulOng	
  at	
  AppNexus	
  (NYC	
  ad	
  tech	
  company)	
  opened	
  eyes	
  to	
  new	
  problems	
  
•  June	
  2011	
  –	
  December	
  2012	
  
• Fix	
  some	
  pandas	
  design	
  issues	
  
• Build	
  out	
  data	
  wrangling	
  capabiliOes	
  (hierarchical	
  indexes,	
  etc.)	
  
• Create	
  “killer	
  apps”	
  (Ome	
  series	
  capabiliOes)	
  
• Evangelize	
  and	
  collaborate	
  with	
  other	
  projects	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Taking	
  advantage	
  of	
  temporary	
  
financial	
  freedom	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Making	
  a	
  book	
  happen	
  
•  A	
  chicken-­‐and-­‐egg	
  problem	
  
•  Fernando	
  Pérez,	
  Brian	
  Granger,	
  and	
  John	
  Hunter	
  
had	
  been	
  toying	
  with	
  the	
  idea	
  of	
  a	
  “SciPy	
  Book”	
  for	
  
a	
  couple	
  years	
  
•  Decided	
  to	
  forge	
  my	
  own	
  path	
  in	
  Nov	
  2011	
  
• WriOng	
  took	
  about	
  9	
  months	
  
• Helped	
  moOvate	
  me	
  to	
  “finish”	
  parts	
  of	
  pandas	
  
•  ~	
  50,000	
  copies	
  in	
  circulaOon	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Clarity	
  and	
  sooware	
  engineering	
  
•  Progress	
  in	
  sooware	
  not	
  just	
  about	
  hard	
  work	
  
•  Solving	
  the	
  right	
  problems	
  
• …	
  in	
  the	
  right	
  order	
  
• …	
  while	
  wasOng	
  liYle	
  Ome/energy	
  on	
  non-­‐impac}ul	
  issues	
  
• …	
  while	
  being	
  faced	
  with	
  real	
  world	
  concerns	
  (80/20	
  rule)	
  
•  Taking	
  the	
  Ome	
  to	
  develop	
  a	
  clear	
  vision	
  and	
  scope	
  for	
  a	
  project	
  is	
  a	
  major	
  factor	
  
in	
  its	
  success	
  or	
  failure	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
It	
  took	
  a	
  village	
  
•  Fernando	
  Perez	
  &	
  Brian	
  Granger	
  (IPython)	
  
•  Skipper	
  Seabold	
  &	
  Josef	
  Perktold	
  (statsmodels)	
  
•  Eric	
  Jones	
  (Enthought)	
  
•  Travis	
  Oliphant	
  &	
  Peter	
  Wang	
  (Enthought	
  &	
  ConOnuum)	
  
•  John	
  Hunter	
  (matplotlib)	
  
•  …	
  and	
  many	
  others	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
An	
  unlikely	
  train	
  ride	
  
	
  
SEA	
  —>	
  PDX	
  
November	
  18,	
  2011	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Seatmate:	
  “Are	
  you	
  a	
  
programmer?”	
  	
  
	
  
(he	
  saw	
  my	
  Emacs	
  buffers)	
  	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes:	
  “	
  Yeah,	
  I	
  do	
  Python”	
  	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Seatmate:	
  “Oh,	
  I	
  do	
  a	
  bit	
  of	
  
Python	
  too”	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes:	
  “Cool,	
  well,	
  there’s	
  
this	
  awesome	
  new	
  thing	
  
called	
  the	
  IPython	
  
notebook”	
  	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  seatmate	
  was	
  computaOonal	
  bio	
  
professor	
  and	
  5-­‐year	
  PSF	
  member	
  	
  
Titus	
  Brown	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
And	
  he	
  would	
  later	
  assist	
  the	
  
IPython	
  team	
  in	
  their	
  Sloan	
  
FoundaOon	
  $1mm	
  grant	
  in	
  2012	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  words	
  about	
  	
  
John	
  Hunter	
  (1968	
  –	
  2012)	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Business	
  ventures	
  2012	
  -­‐	
  2014	
  
•  2012	
  :	
  Lambda	
  Foundry	
  
• Support	
  and	
  develop	
  pandas	
  
• Explored	
  creaOng	
  a	
  commercial	
  Python	
  financial	
  toolkit	
  
•  2013	
  –	
  2014	
  :	
  DataPad	
  
• “Google	
  Drive	
  for	
  AnalyOcs	
  /	
  BI”	
  
• With	
  Chang	
  She	
  (MIT	
  —>	
  AQR	
  —>	
  pandas)	
  
• Silicon	
  Valley	
  VC-­‐backed	
  
• Acquired	
  by	
  Cloudera	
  in	
  September	
  2014	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Cloudera	
  
•  Sort	
  of	
  “the	
  Red	
  Hat	
  of	
  Big	
  Data”	
  
•  The	
  leading	
  open	
  source	
  Hadoop	
  pla}orm	
  
•  SupporOng	
  and	
  developing	
  a	
  liYle	
  over	
  20	
  Apache-­‐licensed	
  open	
  source	
  projects	
  
•  A	
  dream	
  job	
  
• Full	
  Ome	
  open	
  source	
  development	
  
• Solving	
  hard	
  data	
  problems	
  faced	
  by	
  the	
  world’s	
  largest	
  companies	
  
•  P.S.	
  we’re	
  hiring	
  engineers	
  in	
  AusOn	
  +	
  Bay	
  Area	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  I’m	
  interested	
  in	
  right	
  now	
  
•  Ways	
  to	
  enable	
  collaboraOon	
  on	
  data	
  tools	
  across	
  programming	
  languages	
  
	
  
•  Domain	
  specific	
  language	
  design	
  and	
  compilaOon	
  
•  Improving	
  the	
  Python-­‐on-­‐Hadoop	
  experience	
  
•  LLVM	
  +	
  Code	
  generaOon	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Different	
  kinds	
  of	
  Big	
  Data	
  
•  Python	
  programmers	
  have	
  been	
  dealing	
  with	
  big	
  scienOfic	
  data	
  in	
  HPC	
  sevngs	
  
for	
  years	
  
•  Big…	
  
• Text	
  data	
  
• Homogeneous	
  array	
  data	
  
• Tabular	
  (structured)	
  data	
  
• JSON-­‐like	
  (semi-­‐structured)	
  data	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  Great	
  Data	
  Tool	
  Decoupling™	
  
•  Thesis:	
  over	
  Ome,	
  user	
  interfaces,	
  data	
  storage,	
  and	
  execuOon	
  engines	
  will	
  
decouple	
  and	
  specialize	
  
•  In	
  fact,	
  you	
  should	
  really	
  want	
  this	
  to	
  happen	
  
• Share	
  systems	
  among	
  languages	
  
• Reduce	
  fragmentaOon	
  and	
  “lock-­‐in”	
  
• Shio	
  developer	
  focus	
  to	
  usability	
  	
  
•  PredicOon:	
  we’ll	
  be	
  there	
  by	
  2025;	
  sooner	
  if	
  we	
  all	
  get	
  our	
  act	
  together	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  

More Related Content

What's hot

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 

What's hot (19)

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 

Viewers also liked

pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsWes McKinney
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 

Viewers also liked (12)

pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
 
Hacking
HackingHacking
Hacking
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Основы MATLAB. Лекция 1.
Основы MATLAB. Лекция 1.Основы MATLAB. Лекция 1.
Основы MATLAB. Лекция 1.
 

Similar to My Data Journey with Python (SciPy 2015 Keynote)

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...Databricks
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analyticsSouth West Data Meetup
 
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...ScalrCMP
 
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...NebulaInc
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Varad Meru
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
OpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic CloudOpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic CloudJakub Pavlik
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightPrecisely
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesAll Things Open
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet CodeDavid Danzilio
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSteven Totman
 
Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015Eduardo Pelegri-Llopart
 
Introduction to OpenStack Storage
Introduction to OpenStack StorageIntroduction to OpenStack Storage
Introduction to OpenStack StorageNetApp
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightCloudera, Inc.
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python EcosystemExploring and Using the Python Ecosystem
Exploring and Using the Python EcosystemAdam Cook
 

Similar to My Data Journey with Python (SciPy 2015 Keynote) (20)

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
 
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
OpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic CloudOpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic Cloud
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015
 
Introduction to OpenStack Storage
Introduction to OpenStack StorageIntroduction to OpenStack Storage
Introduction to OpenStack Storage
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
How to Enterprise Node
How to Enterprise NodeHow to Enterprise Node
How to Enterprise Node
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python EcosystemExploring and Using the Python Ecosystem
Exploring and Using the Python Ecosystem
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 

More from Wes McKinney (16)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

My Data Journey with Python (SciPy 2015 Keynote)

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   My  Data  Journey  with  Python   Wes  McKinney  @wesmckinn   SciPy  2015  Keynote,  2015-­‐07-­‐09  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  2007-­‐present,  from  my  perspecOve   •  CelebraOng  our  successes   •  Challenges  and  opportuniOes  for  the  future  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Why  are  we  all  here?  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   My  pre-­‐2007  existence   •  I  was  a  mathemaOcian!   •  No  exposure  to  Python,  SQL,  R  (or  any  analyOcs  for  that  maYer)   •  Rude  awakening  ahead  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   My  first  job:  AQR  (quant  hedge  fund)   •  A  quant  finance  operaOon  that  lived  and  breathed  SQL  and  Excel   •  ProducOon  systems  in  C++,  Java,  Visual  BASIC,  and  C#  .NET   •  Some  PhD-­‐level  researchers  used  MATLAB  for  research  (as  was  common  in   finance  /  economics  departments)  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   ProducOvity  frustraOons   •  First  year:  several  analyOcs  and  staOsOcal  data  analysis  projects   • A  huge  amount  of  SQL   • Some  Java   • A  liYle  bit  of  R   • …  and  TONS  of  Excel   •  Projects  felt  like  5%  conceptualizaOon,  95%  tedium  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Python  in  early  2008:  different  Omes   •  A  bleeding  edge  stack   • NumPy  1.0.4   • SciPy  0.6.0   • matplotlib  0.91.2   • IPython  0.8.4,  SVN  history  begins  2/2008   • Cython  0.9.8   •  The  scienOfic  Python  community  seemed  mainly  focused  on  aYracOng  MATLAB,   HPC,  and  scienOfic  lab  users  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   2008:  Things  SciPythonistas  didn’t  care  too  much  about   •  RelaOonal  data  or  SQL   •  Missing  data  handling  (outside  numpy.ma)   •  StaOsOcs  and  econometrics  (first  statsmodels  release:  2011)   •  StaOsOcal  graphics   •  Machine  learning  (scikit-­‐learn  0.1:  2/2010)   •  AnalyOcs  and  business  intelligence  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Taking  a  gamble   •  Decided  to  give  Python  a  shot  for  AQR  projects  aoer  seeing  part  of  MASS  R   package  ported  in  scipy.stats.models  by  Jonathan  Taylor  at  Stanford   •  proto-­‐pandas  first  version  built  in  April  2008   • Focused  on  porOng  an  R  project  to  Python   •  May  ‘08:  Embedded  Python  interpreter  in  a  legacy  C++  system   •  5/2008  –  12/2008:  Skunkworks  Python  ports  and  evangelism  across  company  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Why  did  Python  work  out?   •  BaYeries  included   •  Interoperability  with  C++   • Embedding  Python  interpreter   • Wrapping  C++  in  Python  C  extensions   •  ProducOve  user  interface   • Python  language   • IPython  +  matplotlib  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  cool  things  we  built   •  A  global  macro  risk  modeling  system  (using  pandas  +  NumPy  +  PyTables)   •  A  heterogeneous  market  data  loading  and  cleaning  system   •  A  task-­‐based  cluster  compuOng  system  (similar  to  Celery)   •  Tick  data  storage  and  analyOcs     •  Various  GUIs  with  wxPython  +  matplotlib  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   End  2009:  pandas!   •  AQR  lets  me  open  source  pandas  0.1  on  Christmas,  2009.   ~/Downloads/pandas-­‐0.1  $  cloc  -­‐-­‐exclude-­‐ext  pandas     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   Language                                          files                    blank                comment                      code   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   Python                                                    41                      3124                      2933                      8225   Cython                                                      7                        418                          93                      1247   C/C++  Header                                          1                            0                            0                            1   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   SUM:                                                        49                      3542                      3026                      9473   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   2010  –  2011:  Python’s  data  growing  pains   •  pandas  did  not  evolve  much  aoer  its  iniOal  release   •  No  consensus  or  momentum  behind  any  project  for  analyOcs  /  data  wrangling   •  AQR  —>  Duke  StaOsOcal  Science   •  AQR  sponsors  bug  fixes  and  new  features  in  pandas  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   May  2011:  Gevng  inspired   •  2011-­‐05-­‐13:  Enthought  Datarray  Summit   • Discuss  how  to  enable  Python  to  become  more  useful  staOsOcal  compuOng   • Me:  “Library  fragmentaOon  is  destrucOve;  integraOon  is  beYer”   • Data  structures,  missing  data,  and  data  wrangling  tools   •  2011-­‐05-­‐23  –  2011-­‐06-­‐03  :  Python  finance  consulOng  engagement   • Realized  that  Python  data  tools  sorely  needed  in  industry   • But  not  nearly  mature  enough  yet  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   2011-­‐05-­‐30  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Tell  me  about  your  use  cases  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Making  pandas  a  beYer  tool   •  ConsulOng  at  AppNexus  (NYC  ad  tech  company)  opened  eyes  to  new  problems   •  June  2011  –  December  2012   • Fix  some  pandas  design  issues   • Build  out  data  wrangling  capabiliOes  (hierarchical  indexes,  etc.)   • Create  “killer  apps”  (Ome  series  capabiliOes)   • Evangelize  and  collaborate  with  other  projects  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Taking  advantage  of  temporary   financial  freedom  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Making  a  book  happen   •  A  chicken-­‐and-­‐egg  problem   •  Fernando  Pérez,  Brian  Granger,  and  John  Hunter   had  been  toying  with  the  idea  of  a  “SciPy  Book”  for   a  couple  years   •  Decided  to  forge  my  own  path  in  Nov  2011   • WriOng  took  about  9  months   • Helped  moOvate  me  to  “finish”  parts  of  pandas   •  ~  50,000  copies  in  circulaOon  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Clarity  and  sooware  engineering   •  Progress  in  sooware  not  just  about  hard  work   •  Solving  the  right  problems   • …  in  the  right  order   • …  while  wasOng  liYle  Ome/energy  on  non-­‐impac}ul  issues   • …  while  being  faced  with  real  world  concerns  (80/20  rule)   •  Taking  the  Ome  to  develop  a  clear  vision  and  scope  for  a  project  is  a  major  factor   in  its  success  or  failure  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   It  took  a  village   •  Fernando  Perez  &  Brian  Granger  (IPython)   •  Skipper  Seabold  &  Josef  Perktold  (statsmodels)   •  Eric  Jones  (Enthought)   •  Travis  Oliphant  &  Peter  Wang  (Enthought  &  ConOnuum)   •  John  Hunter  (matplotlib)   •  …  and  many  others  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   An  unlikely  train  ride     SEA  —>  PDX   November  18,  2011  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Seatmate:  “Are  you  a   programmer?”       (he  saw  my  Emacs  buffers)    
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Wes:  “  Yeah,  I  do  Python”    
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Seatmate:  “Oh,  I  do  a  bit  of   Python  too”  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Wes:  “Cool,  well,  there’s   this  awesome  new  thing   called  the  IPython   notebook”    
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   My  seatmate  was  computaOonal  bio   professor  and  5-­‐year  PSF  member     Titus  Brown  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   And  he  would  later  assist  the   IPython  team  in  their  Sloan   FoundaOon  $1mm  grant  in  2012  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Some  words  about     John  Hunter  (1968  –  2012)  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Business  ventures  2012  -­‐  2014   •  2012  :  Lambda  Foundry   • Support  and  develop  pandas   • Explored  creaOng  a  commercial  Python  financial  toolkit   •  2013  –  2014  :  DataPad   • “Google  Drive  for  AnalyOcs  /  BI”   • With  Chang  She  (MIT  —>  AQR  —>  pandas)   • Silicon  Valley  VC-­‐backed   • Acquired  by  Cloudera  in  September  2014  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera   •  Sort  of  “the  Red  Hat  of  Big  Data”   •  The  leading  open  source  Hadoop  pla}orm   •  SupporOng  and  developing  a  liYle  over  20  Apache-­‐licensed  open  source  projects   •  A  dream  job   • Full  Ome  open  source  development   • Solving  hard  data  problems  faced  by  the  world’s  largest  companies   •  P.S.  we’re  hiring  engineers  in  AusOn  +  Bay  Area  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   What  I’m  interested  in  right  now   •  Ways  to  enable  collaboraOon  on  data  tools  across  programming  languages     •  Domain  specific  language  design  and  compilaOon   •  Improving  the  Python-­‐on-­‐Hadoop  experience   •  LLVM  +  Code  generaOon  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Different  kinds  of  Big  Data   •  Python  programmers  have  been  dealing  with  big  scienOfic  data  in  HPC  sevngs   for  years   •  Big…   • Text  data   • Homogeneous  array  data   • Tabular  (structured)  data   • JSON-­‐like  (semi-­‐structured)  data  
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   The  Great  Data  Tool  Decoupling™   •  Thesis:  over  Ome,  user  interfaces,  data  storage,  and  execuOon  engines  will   decouple  and  specialize   •  In  fact,  you  should  really  want  this  to  happen   • Share  systems  among  languages   • Reduce  fragmentaOon  and  “lock-­‐in”   • Shio  developer  focus  to  usability     •  PredicOon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn