SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
PyData:	
  The	
  Next	
  Genera@on	
  
Wes	
  McKinney	
  @wesmckinn	
  
Data	
  Day	
  Texas	
  2015	
  #ddtx15	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
PyData:	
  Everything’s	
  
awesome…or	
  is	
  it?	
  
Wes	
  McKinney	
  @wesmckinn	
  
Data	
  Day	
  Texas	
  2015	
  #ddtx15	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  systems,	
  tools,	
  Python	
  guru	
  at	
  Cloudera	
  
•  Formerly	
  Founder/CEO	
  of	
  DataPad	
  (visual	
  analy@cs	
  startup)	
  
•  Created	
  pandas	
  in	
  2008,	
  lead	
  developer	
  un@l	
  2013	
  
•  Python	
  for	
  Data	
  Analysis,	
  published	
  10/2012	
  
• O’Reilly’s	
  best-­‐selling	
  data	
  book	
  of	
  2014	
  
•  Pythonista	
  since	
  2007	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  this	
  about?	
  
•  Hopes	
  and	
  fears	
  for	
  the	
  community	
  and	
  ecosystem	
  
•  Why	
  do	
  I	
  care?	
  
• Python	
  is	
  fun!	
  
• Leverage	
  
• Accessibility	
  for	
  newbies	
  
• Community:	
  smart,	
  nice,	
  humble	
  people	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  at	
  Cloudera	
  
•  Want	
  Cloudera	
  plaaorm	
  users	
  to	
  be	
  successful	
  with	
  Python	
  
•  Spark/PySpark	
  part	
  of	
  the	
  Enterprise	
  Data	
  Hub	
  /	
  CDH	
  
•  Ac@vely	
  inves@ng	
  in	
  Python	
  tooling	
  
• (p.s.	
  we’re	
  hiring?)	
  
• (p.p.s.	
  we	
  have	
  an	
  Aus@n	
  office	
  now!)	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Historical	
  perspec@ve	
  and	
  background	
  
•  20	
  years	
  of	
  fast	
  numerical	
  compu@ng	
  in	
  Python	
  (Numeric	
  1995)	
  
•  10	
  years	
  of	
  NumPy	
  
•  PyData	
  becomes	
  a	
  thing	
  in	
  2012	
  
•  Python	
  as	
  a	
  data	
  language	
  goes	
  mainstream	
  
• Job	
  descrip@ons	
  tell	
  all	
  
• Shig	
  in	
  larger	
  Python	
  community	
  from	
  web	
  towards	
  data	
  
•  PyCon	
  2015	
  commihee	
  reported	
  substan@al	
  growth	
  in	
  data-­‐related	
  
submissions!	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How’d	
  this	
  happen?	
  
•  Data,	
  data	
  everywhere	
  
•  Science!	
  scikit-­‐learn,	
  statsmodels,	
  and	
  friends	
  
•  Comprehensive	
  data	
  wrangling	
  tools	
  and	
  in-­‐memory	
  analy@cs/repor@ng	
  (pandas)	
  
•  IPython	
  Notebook	
  
•  Learning	
  resources	
  (books,	
  conferences,	
  blogs,	
  etc.)	
  
•  Python	
  environment/library	
  management	
  that	
  “just	
  works”	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Put	
  a	
  Python	
  (interface)	
  on	
  it!	
  
Something	
  no	
  one	
  got	
  fired	
  for,	
  ever.	
  	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Meanwhile…	
  
•  Hadoop	
  and	
  Big	
  Data	
  go	
  mainstream	
  in	
  2009	
  onward	
  
• First	
  Hadoop	
  World:	
  Fall	
  2009 	
  	
  
• First	
  Strata	
  conference:	
  Spring	
  2011	
  
•  Lots	
  of	
  smart	
  engineers	
  in	
  fast-­‐growing	
  businesses	
  with	
  massive	
  analy@cs	
  /	
  ETL	
  
problems	
  
•  Solu@ons	
  built,	
  frameworks	
  developed,	
  companies	
  founded	
  
•  Python	
  was	
  generally	
  not	
  a	
  central	
  part	
  of	
  those	
  solu@ons	
  
• A	
  lot	
  of	
  our	
  nice	
  things	
  weren’t	
  much	
  help	
  for	
  data	
  munging	
  and	
  coun@ng	
  at	
  
scale	
  (more	
  on	
  this	
  later)	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
We’re	
  lucky	
  to	
  have	
  lots	
  of	
  nice	
  things	
  
•  What	
  a	
  language!	
  
•  IPython:	
  interac@ve	
  compu@ng	
  and	
  collabora@on	
  
•  Libraries	
  to	
  solve	
  nearly	
  any	
  (non-­‐big	
  data)	
  problem	
  
•  Trustworthy	
  (medium)	
  data	
  wrangling,	
  sta@s@cs,	
  machine	
  learning	
  
•  HPC	
  /	
  GPU	
  /	
  parallel	
  compu@ng	
  frameworks	
  
•  FFI	
  tools	
  
•  …	
  and	
  much	
  more	
  
	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
“If	
  this	
  isn’t	
  nice,	
  what	
  is?”	
  
—Kurt	
  Vonnegut	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
So,	
  what	
  kind	
  of	
  big	
  data?	
  
•  Big	
  mul@dimensional	
  arrays	
  /	
  linear	
  algebra	
  
•  Big	
  tables	
  (structured	
  data)	
  
•  Big	
  text	
  data	
  (unstructured	
  data)	
  
•  Empirically	
  I	
  personally	
  am	
  mostly	
  interested	
  in	
  big	
  tables	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  kind	
  of	
  big	
  data	
  problems?	
  
•  ETL	
  /	
  Data	
  Wrangling	
  
• Python	
  been	
  used	
  here	
  for	
  years	
  with	
  Hadoop	
  Streaming	
  
•  BI	
  /	
  Analy@cs	
  (“things	
  you	
  can	
  do	
  in	
  SQL”)	
  
•  Advanced	
  Analy@cs	
  /	
  Machine	
  Learning	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  ways	
  we	
  are	
  #winning	
  
•  Python	
  seen	
  as	
  a	
  viable	
  alterna@ve	
  to	
  SAS/MATLAB/proprietary	
  sogware	
  without	
  
nearly	
  as	
  much	
  arguing	
  
•  Huge	
  uptake	
  in	
  the	
  financial	
  sector	
  
•  Many	
  current	
  and	
  upcoming	
  genera@ons	
  of	
  data	
  scien@sts	
  learning	
  Python	
  as	
  a	
  
first	
  language	
  
•  Python	
  in	
  HPC	
  /	
  scien@fic	
  compu@ng	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  ways	
  we	
  are	
  not	
  #winning	
  
•  Python	
  s@ll	
  doesn’t	
  have	
  a	
  great	
  “big	
  data	
  story”	
  
•  Lihle	
  venture	
  capital	
  trickling	
  down	
  to	
  Python	
  projects	
  
•  Data	
  structures	
  and	
  programming	
  APIs	
  lagging	
  modern	
  reali@es	
  
•  Weak	
  support	
  for	
  emerging	
  data	
  formats	
  
•  Many	
  companies	
  with	
  Python	
  big	
  data	
  successes	
  have	
  not	
  open-­‐sourced	
  their	
  
work	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  in	
  big	
  data	
  workflows	
  in	
  prac@ce	
  
HDFS	
   Hadoop-­‐MR	
  
Spark	
   SQL	
  
Big	
  Data,	
  Many	
  machines	
   Small/Medium	
  Data,	
  One	
  Machine	
  
pandas	
  
Viz	
  tools	
  
ML	
  /	
  Stats	
  
More	
  coun@ng	
  /	
  ETL	
   More	
  insights	
  /	
  repor@ng	
  
DSLs	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Big	
  data	
  storage	
  formats	
  
•  JSON	
  and	
  CSV	
  are	
  not	
  a	
  good	
  way	
  to	
  warehouse	
  data	
  
•  Apache	
  Avro	
  
• Compact	
  binary	
  data	
  serializa@on	
  format	
  
• RPC	
  framework	
  
•  Apache	
  Parquet	
  
• Efficient	
  columnar	
  data	
  format	
  op@mized	
  for	
  HDFS	
  
• Supports	
  nested	
  and	
  repeated	
  fields,	
  compression,	
  encoding	
  schemes	
  
• Co-­‐developed	
  by	
  Twiher	
  and	
  Cloudera	
  
• Reference	
  impl’s	
  in	
  Impala	
  (C++),	
  and	
  standalone	
  Java/Scala	
  (used	
  in	
  Spark)	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
We’re	
  living	
  in	
  a	
  JVM	
  world	
  
•  Scala	
  rapidly	
  taking	
  over	
  big	
  data	
  analy@cs	
  
• Func@onal,	
  concise,	
  good	
  for	
  building	
  high	
  level	
  DSLs	
  
• Build	
  nice	
  Scala	
  APIs	
  to	
  clunkier	
  Java	
  frameworks	
  
•  JVM	
  legi@mately	
  good	
  for	
  concurrent,	
  distributed	
  systems	
  
•  Binary	
  interface	
  with	
  Python	
  a	
  major	
  issue	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Dremel,	
  baby,	
  Dremel…	
  
•  VLDB	
  2010:	
  Dremel:	
  Interac5ve	
  Analysis	
  of	
  Web-­‐Scale	
  Datasets	
  
•  Inspira@on	
  for	
  Parquet	
  (cf	
  blog	
  “Dremel	
  made	
  easy	
  with	
  Parquet”)	
  
•  Peta-­‐scale	
  analy@cs	
  directly	
  on	
  nested	
  data	
  
•  Google	
  BigQuery	
  said	
  to	
  be	
  a	
  IaaS-­‐ifica@on	
  of	
  Dremel	
  
• Supports	
  SQL	
  variant	
  +	
  new	
  user-­‐defined	
  func@ons	
  with	
  JavaScript	
  +	
  V8	
  
SELECT COUNT(c1 > c2)
FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1,
SUM(a.b.p.q.r) WITHIN RECORD AS c2
FROM T3)
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Cloudera	
  Impala	
  
•  Open-­‐source	
  interac@ve	
  SQL	
  for	
  Hadoop	
  
•  Analy@cal	
  query	
  processor	
  wrihen	
  in	
  C++	
  with	
  LLVM	
  code	
  genera@on	
  
•  Op@mized	
  to	
  scan	
  tables	
  (best	
  as	
  Parquet	
  format)	
  in	
  HDFS	
  
•  SQL	
  front-­‐end	
  and	
  query	
  op@mizer	
  /	
  planner	
  
	
  
•  User-­‐defined	
  func@on	
  API	
  (C++)	
  
• impyla	
  enables	
  Python	
  UDFs	
  to	
  be	
  compiled	
  with	
  Numba	
  to	
  LLVM	
  IR	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Cloudera	
  Impala	
  (cont’d)	
  
•  For	
  high	
  performance	
  big	
  data	
  analy@cs,	
  Impala	
  could	
  be	
  Python’s	
  best	
  friend	
  
•  C++/LLVM	
  backend	
  is	
  lower-­‐level	
  than	
  SQL	
  
•  Nested	
  data	
  support	
  is	
  coming	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  interes@ng	
  things	
  in	
  recent	
  
@mes	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Set	
  point:	
  Hadley	
  Wickham	
  
•  R	
  has	
  upped	
  it’s	
  game	
  with	
  dplyr,	
  @dyr,	
  and	
  other	
  new	
  projects	
  
•  New	
  standard	
  for	
  a	
  uniform	
  interface	
  to	
  either	
  in-­‐memory	
  or	
  in-­‐database	
  data	
  
processing	
  
•  Composable	
  table	
  primi@ve	
  opera@ons	
  
•  Mul@ple	
  major	
  versions	
  shipped,	
  gevng	
  adopted	
  
	
  
80dc69b 2012-10-28 | Initial commit of dplyr [hadley]
tbl	
  %>%	
  filter(c==‘bar’)	
  %>%	
  group_by(a,	
  b)	
  
	
  	
  	
  	
  %>%	
  summarise(metric=mean(d	
  –	
  f))	
  
	
  	
  	
  	
  %>%	
  arrange(desc(metric))	
  
	
  	
  	
  	
  	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Blaze	
  
•  Shares	
  some	
  seman@cs	
  with	
  dplyr	
  
•  Uses	
  a	
  generalized	
  datashape	
  protocol	
  
•  Fresh	
  start	
  in	
  2014	
  under	
  Mahhew	
  Rocklin’s	
  (Con@nuum)	
  direc@on	
  
• Deferred	
  expression	
  API	
  
• Support	
  for	
  piping	
  data	
  between	
  storage	
  systems	
  
• Mul@ple	
  backends	
  (pandas,	
  SQL,	
  MongoDB,	
  PySpark,	
  …)	
  
• Growing	
  support	
  for	
  out-­‐of-­‐core	
  analy@cs	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
libdynd	
  
•  Led	
  by	
  Mark	
  Wiebe	
  at	
  Con@nuum	
  Analy@cs	
  
•  Pure	
  C++11	
  modern	
  reimagining	
  of	
  NumPy	
  
•  Python	
  bindings	
  
•  Supports	
  variadic	
  data	
  cells	
  and	
  nested	
  types	
  (datashape	
  protocol)	
  
•  Development	
  has	
  focused	
  on	
  the	
  data	
  container	
  design	
  over	
  analy@cs	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
PySpark	
  
•  Popularity	
  may	
  exceed	
  official	
  Scala	
  API	
  
•  Spark	
  was	
  not	
  exactly	
  designed	
  to	
  be	
  an	
  ideal	
  companion	
  to	
  Python	
  
•  General	
  architecture	
  
• Users	
  build	
  Spark	
  deferred	
  expression	
  graphs	
  in	
  Python	
  
• User-­‐supplied	
  func@ons	
  are	
  serialized	
  and	
  broadcast	
  around	
  the	
  cluster	
  
• Spark	
  plans	
  job	
  and	
  breaks	
  work	
  into	
  tasks	
  executed	
  by	
  Python	
  worker	
  jobs	
  
•  Data	
  is	
  managed	
  /	
  shuffled	
  by	
  the	
  Spark	
  Scala	
  master	
  process	
  
•  Python	
  used	
  largely	
  as	
  a	
  black	
  box	
  to	
  transform	
  input	
  to	
  output	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
PySpark:	
  Some	
  more	
  gory	
  details	
  
•  Spark	
  master	
  controlled	
  using	
  py4j	
  	
  
• Py4J	
  docs:	
  “If	
  performance	
  is	
  cri@cal	
  to	
  your	
  applica@on,	
  accessing	
  Java	
  objects	
  
from	
  Python	
  programs	
  might	
  not	
  be	
  the	
  best	
  idea”	
  
•  Data	
  is	
  marshalled	
  mostly	
  with	
  files	
  with	
  various	
  serializa@on	
  protocols	
  (pickle	
  +	
  
bespoke	
  formats)	
  
•  Does	
  not	
  na5vely	
  interface	
  with	
  NumPy	
  (yet)	
  
•  But,	
  the	
  in-­‐memory	
  benefits	
  of	
  Spark	
  over	
  Hadoop	
  Streaming	
  alterna@ves	
  
massively	
  outweigh	
  the	
  downsides	
  
# pass large object by py4j is very slow and need much memory
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spartan	
  
•  hhp://github.com/spartan-­‐array/spartan	
  
•  Python	
  distributed	
  array	
  expression	
  evaluator	
  (“distributed	
  NumPy”)	
  
•  Developed	
  by	
  Russell	
  Power	
  &	
  others	
  at	
  NYU	
  
•  Uses	
  ZeroMQ	
  and	
  custom	
  RPC	
  implementa@on	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Things	
  I	
  think	
  we	
  should	
  do	
  
•  Create	
  high	
  fidelity	
  data	
  structures	
  for	
  Dremel-­‐style	
  data	
  
•  Get	
  serious	
  about	
  Avro,	
  Parquet,	
  and	
  other	
  new	
  data	
  format	
  standards	
  
•  Invest	
  in	
  the	
  Python-­‐Impala-­‐LLVM	
  rela@onship	
  
•  Efficient	
  binary	
  protocols	
  to	
  receive	
  and	
  emit	
  data	
  from	
  Python	
  processes	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Conclusions	
  
•  Python	
  +	
  PyData	
  stack	
  is	
  as	
  strong	
  as	
  ever,	
  and	
  s@ll	
  gaining	
  momentum	
  
•  The	
  @me	
  for	
  a	
  “dark	
  horse”	
  Python-­‐centric	
  big	
  data	
  solu@on	
  has	
  probably	
  passed	
  
us	
  by.	
  Maybe	
  beher	
  to	
  pursue	
  alliances.	
  
•  Focused	
  work	
  is	
  needed	
  to	
  s@ll	
  be	
  relevant	
  in	
  2020.	
  Some	
  of	
  our	
  compe@@ve	
  
advantages	
  are	
  eroding	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
wes@cloudera.com	
  

More Related Content

What's hot

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 

What's hot (20)

High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 

Viewers also liked

Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
Jim Chang
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術
Drecom Co., Ltd.
 
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
Takahiro Inoue
 
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
Takahiro Inoue
 

Viewers also liked (20)

ログ分析のある生活(概要編)
ログ分析のある生活(概要編)ログ分析のある生活(概要編)
ログ分析のある生活(概要編)
 
Python untuk Pemrosesan Teks Bahasa Indonesia
Python untuk Pemrosesan Teks Bahasa IndonesiaPython untuk Pemrosesan Teks Bahasa Indonesia
Python untuk Pemrosesan Teks Bahasa Indonesia
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術
 
Pyladies Tokyo meet up #6
Pyladies Tokyo meet up #6Pyladies Tokyo meet up #6
Pyladies Tokyo meet up #6
 
2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化
 
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
MongoDBを用いたソーシャルアプリのログ解析 〜解析基盤構築からフロントUIまで、MongoDBを最大限に活用する〜
 
2 1.予測と確率分布
2 1.予測と確率分布2 1.予測と確率分布
2 1.予測と確率分布
 
サービス改善はログデータ分析から
サービス改善はログデータ分析からサービス改善はログデータ分析から
サービス改善はログデータ分析から
 
2 5 3.一般化線形モデル色々_Gamma回帰と対数線形モデル
2 5 3.一般化線形モデル色々_Gamma回帰と対数線形モデル2 5 3.一般化線形モデル色々_Gamma回帰と対数線形モデル
2 5 3.一般化線形モデル色々_Gamma回帰と対数線形モデル
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
2 5 2.一般化線形モデル色々_ロジスティック回帰
2 5 2.一般化線形モデル色々_ロジスティック回帰2 5 2.一般化線形モデル色々_ロジスティック回帰
2 5 2.一般化線形モデル色々_ロジスティック回帰
 
2 4.devianceと尤度比検定
2 4.devianceと尤度比検定2 4.devianceと尤度比検定
2 4.devianceと尤度比検定
 
2 2.尤度と最尤法
2 2.尤度と最尤法2 2.尤度と最尤法
2 2.尤度と最尤法
 
2 7.一般化線形混合モデル
2 7.一般化線形混合モデル2 7.一般化線形混合モデル
2 7.一般化線形混合モデル
 
2 6.ゼロ切断・過剰モデル
2 6.ゼロ切断・過剰モデル2 6.ゼロ切断・過剰モデル
2 6.ゼロ切断・過剰モデル
 
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
「GraphDB徹底入門」〜構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く〜
 

Similar to PyData: The Next Generation

Similar to PyData: The Next Generation (20)

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 

More from Wes McKinney

More from Wes McKinney (19)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 

PyData: The Next Generation

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   PyData:  The  Next  Genera@on   Wes  McKinney  @wesmckinn   Data  Day  Texas  2015  #ddtx15  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   PyData:  Everything’s   awesome…or  is  it?   Wes  McKinney  @wesmckinn   Data  Day  Texas  2015  #ddtx15  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  systems,  tools,  Python  guru  at  Cloudera   •  Formerly  Founder/CEO  of  DataPad  (visual  analy@cs  startup)   •  Created  pandas  in  2008,  lead  developer  un@l  2013   •  Python  for  Data  Analysis,  published  10/2012   • O’Reilly’s  best-­‐selling  data  book  of  2014   •  Pythonista  since  2007  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  this  about?   •  Hopes  and  fears  for  the  community  and  ecosystem   •  Why  do  I  care?   • Python  is  fun!   • Leverage   • Accessibility  for  newbies   • Community:  smart,  nice,  humble  people  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Python  at  Cloudera   •  Want  Cloudera  plaaorm  users  to  be  successful  with  Python   •  Spark/PySpark  part  of  the  Enterprise  Data  Hub  /  CDH   •  Ac@vely  inves@ng  in  Python  tooling   • (p.s.  we’re  hiring?)   • (p.p.s.  we  have  an  Aus@n  office  now!)  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Historical  perspec@ve  and  background   •  20  years  of  fast  numerical  compu@ng  in  Python  (Numeric  1995)   •  10  years  of  NumPy   •  PyData  becomes  a  thing  in  2012   •  Python  as  a  data  language  goes  mainstream   • Job  descrip@ons  tell  all   • Shig  in  larger  Python  community  from  web  towards  data   •  PyCon  2015  commihee  reported  substan@al  growth  in  data-­‐related   submissions!  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   How’d  this  happen?   •  Data,  data  everywhere   •  Science!  scikit-­‐learn,  statsmodels,  and  friends   •  Comprehensive  data  wrangling  tools  and  in-­‐memory  analy@cs/repor@ng  (pandas)   •  IPython  Notebook   •  Learning  resources  (books,  conferences,  blogs,  etc.)   •  Python  environment/library  management  that  “just  works”  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Put  a  Python  (interface)  on  it!   Something  no  one  got  fired  for,  ever.    
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Meanwhile…   •  Hadoop  and  Big  Data  go  mainstream  in  2009  onward   • First  Hadoop  World:  Fall  2009     • First  Strata  conference:  Spring  2011   •  Lots  of  smart  engineers  in  fast-­‐growing  businesses  with  massive  analy@cs  /  ETL   problems   •  Solu@ons  built,  frameworks  developed,  companies  founded   •  Python  was  generally  not  a  central  part  of  those  solu@ons   • A  lot  of  our  nice  things  weren’t  much  help  for  data  munging  and  coun@ng  at   scale  (more  on  this  later)  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  lucky  to  have  lots  of  nice  things   •  What  a  language!   •  IPython:  interac@ve  compu@ng  and  collabora@on   •  Libraries  to  solve  nearly  any  (non-­‐big  data)  problem   •  Trustworthy  (medium)  data  wrangling,  sta@s@cs,  machine  learning   •  HPC  /  GPU  /  parallel  compu@ng  frameworks   •  FFI  tools   •  …  and  much  more    
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.     “If  this  isn’t  nice,  what  is?”   —Kurt  Vonnegut  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   So,  what  kind  of  big  data?   •  Big  mul@dimensional  arrays  /  linear  algebra   •  Big  tables  (structured  data)   •  Big  text  data  (unstructured  data)   •  Empirically  I  personally  am  mostly  interested  in  big  tables  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   What  kind  of  big  data  problems?   •  ETL  /  Data  Wrangling   • Python  been  used  here  for  years  with  Hadoop  Streaming   •  BI  /  Analy@cs  (“things  you  can  do  in  SQL”)   •  Advanced  Analy@cs  /  Machine  Learning  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Some  ways  we  are  #winning   •  Python  seen  as  a  viable  alterna@ve  to  SAS/MATLAB/proprietary  sogware  without   nearly  as  much  arguing   •  Huge  uptake  in  the  financial  sector   •  Many  current  and  upcoming  genera@ons  of  data  scien@sts  learning  Python  as  a   first  language   •  Python  in  HPC  /  scien@fic  compu@ng  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Some  ways  we  are  not  #winning   •  Python  s@ll  doesn’t  have  a  great  “big  data  story”   •  Lihle  venture  capital  trickling  down  to  Python  projects   •  Data  structures  and  programming  APIs  lagging  modern  reali@es   •  Weak  support  for  emerging  data  formats   •  Many  companies  with  Python  big  data  successes  have  not  open-­‐sourced  their   work  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Python  in  big  data  workflows  in  prac@ce   HDFS   Hadoop-­‐MR   Spark   SQL   Big  Data,  Many  machines   Small/Medium  Data,  One  Machine   pandas   Viz  tools   ML  /  Stats   More  coun@ng  /  ETL   More  insights  /  repor@ng   DSLs  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Big  data  storage  formats   •  JSON  and  CSV  are  not  a  good  way  to  warehouse  data   •  Apache  Avro   • Compact  binary  data  serializa@on  format   • RPC  framework   •  Apache  Parquet   • Efficient  columnar  data  format  op@mized  for  HDFS   • Supports  nested  and  repeated  fields,  compression,  encoding  schemes   • Co-­‐developed  by  Twiher  and  Cloudera   • Reference  impl’s  in  Impala  (C++),  and  standalone  Java/Scala  (used  in  Spark)  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  living  in  a  JVM  world   •  Scala  rapidly  taking  over  big  data  analy@cs   • Func@onal,  concise,  good  for  building  high  level  DSLs   • Build  nice  Scala  APIs  to  clunkier  Java  frameworks   •  JVM  legi@mately  good  for  concurrent,  distributed  systems   •  Binary  interface  with  Python  a  major  issue  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Dremel,  baby,  Dremel…   •  VLDB  2010:  Dremel:  Interac5ve  Analysis  of  Web-­‐Scale  Datasets   •  Inspira@on  for  Parquet  (cf  blog  “Dremel  made  easy  with  Parquet”)   •  Peta-­‐scale  analy@cs  directly  on  nested  data   •  Google  BigQuery  said  to  be  a  IaaS-­‐ifica@on  of  Dremel   • Supports  SQL  variant  +  new  user-­‐defined  func@ons  with  JavaScript  +  V8   SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1, SUM(a.b.p.q.r) WITHIN RECORD AS c2 FROM T3)
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Impala   •  Open-­‐source  interac@ve  SQL  for  Hadoop   •  Analy@cal  query  processor  wrihen  in  C++  with  LLVM  code  genera@on   •  Op@mized  to  scan  tables  (best  as  Parquet  format)  in  HDFS   •  SQL  front-­‐end  and  query  op@mizer  /  planner     •  User-­‐defined  func@on  API  (C++)   • impyla  enables  Python  UDFs  to  be  compiled  with  Numba  to  LLVM  IR  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Impala  (cont’d)   •  For  high  performance  big  data  analy@cs,  Impala  could  be  Python’s  best  friend   •  C++/LLVM  backend  is  lower-­‐level  than  SQL   •  Nested  data  support  is  coming  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Some  interes@ng  things  in  recent   @mes  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Set  point:  Hadley  Wickham   •  R  has  upped  it’s  game  with  dplyr,  @dyr,  and  other  new  projects   •  New  standard  for  a  uniform  interface  to  either  in-­‐memory  or  in-­‐database  data   processing   •  Composable  table  primi@ve  opera@ons   •  Mul@ple  major  versions  shipped,  gevng  adopted     80dc69b 2012-10-28 | Initial commit of dplyr [hadley] tbl  %>%  filter(c==‘bar’)  %>%  group_by(a,  b)          %>%  summarise(metric=mean(d  –  f))          %>%  arrange(desc(metric))            
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Blaze   •  Shares  some  seman@cs  with  dplyr   •  Uses  a  generalized  datashape  protocol   •  Fresh  start  in  2014  under  Mahhew  Rocklin’s  (Con@nuum)  direc@on   • Deferred  expression  API   • Support  for  piping  data  between  storage  systems   • Mul@ple  backends  (pandas,  SQL,  MongoDB,  PySpark,  …)   • Growing  support  for  out-­‐of-­‐core  analy@cs  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   libdynd   •  Led  by  Mark  Wiebe  at  Con@nuum  Analy@cs   •  Pure  C++11  modern  reimagining  of  NumPy   •  Python  bindings   •  Supports  variadic  data  cells  and  nested  types  (datashape  protocol)   •  Development  has  focused  on  the  data  container  design  over  analy@cs  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   PySpark   •  Popularity  may  exceed  official  Scala  API   •  Spark  was  not  exactly  designed  to  be  an  ideal  companion  to  Python   •  General  architecture   • Users  build  Spark  deferred  expression  graphs  in  Python   • User-­‐supplied  func@ons  are  serialized  and  broadcast  around  the  cluster   • Spark  plans  job  and  breaks  work  into  tasks  executed  by  Python  worker  jobs   •  Data  is  managed  /  shuffled  by  the  Spark  Scala  master  process   •  Python  used  largely  as  a  black  box  to  transform  input  to  output  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   PySpark:  Some  more  gory  details   •  Spark  master  controlled  using  py4j     • Py4J  docs:  “If  performance  is  cri@cal  to  your  applica@on,  accessing  Java  objects   from  Python  programs  might  not  be  the  best  idea”   •  Data  is  marshalled  mostly  with  files  with  various  serializa@on  protocols  (pickle  +   bespoke  formats)   •  Does  not  na5vely  interface  with  NumPy  (yet)   •  But,  the  in-­‐memory  benefits  of  Spark  over  Hadoop  Streaming  alterna@ves   massively  outweigh  the  downsides   # pass large object by py4j is very slow and need much memory
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Spartan   •  hhp://github.com/spartan-­‐array/spartan   •  Python  distributed  array  expression  evaluator  (“distributed  NumPy”)   •  Developed  by  Russell  Power  &  others  at  NYU   •  Uses  ZeroMQ  and  custom  RPC  implementa@on  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Things  I  think  we  should  do   •  Create  high  fidelity  data  structures  for  Dremel-­‐style  data   •  Get  serious  about  Avro,  Parquet,  and  other  new  data  format  standards   •  Invest  in  the  Python-­‐Impala-­‐LLVM  rela@onship   •  Efficient  binary  protocols  to  receive  and  emit  data  from  Python  processes  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Conclusions   •  Python  +  PyData  stack  is  as  strong  as  ever,  and  s@ll  gaining  momentum   •  The  @me  for  a  “dark  horse”  Python-­‐centric  big  data  solu@on  has  probably  passed   us  by.  Maybe  beher  to  pursue  alliances.   •  Focused  work  is  needed  to  s@ll  be  relevant  in  2020.  Some  of  our  compe@@ve   advantages  are  eroding  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   wes@cloudera.com