SlideShare a Scribd company logo
1 of 34
Download to read offline
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
DataFrames:	
  The	
  Extended	
  Cut	
  
Wes	
  McKinney	
  @wesmckinn	
  
Open	
  Data	
  Science	
  Conference,	
  2015-­‐05-­‐31	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  Some	
  commentary	
  /	
  notes	
  on	
  all	
  the	
  data	
  frame	
  interfaces	
  out	
  there	
  
•  Community	
  /	
  collaboraQon	
  challenges	
  
•  Opinions	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Disclaimer:	
  This	
  is	
  a	
  nuanced	
  
discussion	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
•  Originator	
  of	
  pandas	
  (2008	
  -­‐	
  )	
  	
  
•  Financial	
  analyQcs	
  in	
  R	
  /	
  Python	
  starQng	
  2007	
  
•  2010-­‐2012	
  
• Hiatus	
  from	
  gainful	
  employment	
  
• Make	
  pandas	
  ready	
  for	
  primeQme	
  
• Write	
  "Python	
  for	
  Data	
  Analysis”	
  
•  2013-­‐2014:	
  DataPad	
  with	
  Chang	
  She	
  &	
  co	
  
•  2014	
  -­‐	
  :	
  Open	
  source	
  development	
  @	
  Cloudera	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  frame,	
  in	
  brief	
  
•  A	
  table-­‐like	
  data	
  structure	
  
•  An	
  API	
  /	
  user	
  interface	
  for	
  the	
  table	
  
• SelecQng	
  data	
  
• Math	
  and	
  relaQonal	
  algebra	
  (join,	
  filter,	
  etc.)	
  
• File	
  /	
  database	
  IO	
  
• ad	
  infinitum	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  axes	
  of	
  comparison	
  
•  Data	
  structure	
  internals	
  (types,	
  in-­‐memory	
  representaQon,	
  etc.)	
  
•  Basic	
  table	
  API	
  
•  RelaQonal	
  algebra	
  support	
  
•  Group-­‐by	
  /	
  split-­‐apply-­‐combine	
  API	
  
•  Performance,	
  memory	
  use,	
  evaluaQon	
  semanQcs	
  
•  Missing	
  data	
  
•  Data	
  Qdying	
  /	
  ETL	
  tools	
  
•  IO	
  uQliQes	
  
•  Domain	
  specific	
  tools	
  (e.g.	
  Qme	
  series)	
  
•  …	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  Great	
  Data	
  Tool	
  Decoupling™	
  
•  Thesis:	
  over	
  Qme,	
  user	
  interfaces,	
  data	
  storage,	
  and	
  execuQon	
  engines	
  will	
  
decouple	
  and	
  specialize	
  
•  In	
  fact,	
  you	
  should	
  really	
  want	
  this	
  to	
  happen	
  
• Share	
  systems	
  among	
  languages	
  
• Reduce	
  fragmentaQon	
  and	
  “lock-­‐in”	
  
• Shih	
  developer	
  focus	
  to	
  usability	
  	
  
•  PredicQon:	
  we’ll	
  be	
  there	
  by	
  2025;	
  sooner	
  if	
  we	
  all	
  get	
  our	
  act	
  together	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Crahing	
  quality	
  data	
  tools	
  
•  Quality	
  /	
  usefulness	
  is	
  usually	
  forged	
  by	
  the	
  fire	
  of	
  bamle	
  
•  Real	
  world	
  use	
  cases	
  and	
  social	
  proof	
  trump	
  theory	
  
•  Eat	
  that	
  dog	
  food	
  
•  When	
  in	
  doubt?	
  Look	
  at	
  the	
  test	
  suite.	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  frames,	
  circa	
  2015	
  
•  Tight	
  coupling	
  amongst	
  
• User	
  API	
  
• In-­‐memory	
  data	
  representaQon	
  
• SerializaQon/deserializaQon	
  tools	
  
• AnalyQcs	
  /	
  computaQon	
  
• Code	
  evaluaQon	
  semanQcs	
  
•  Code	
  sharing	
  amongst	
  languages	
  fairly	
  difficult,	
  rarely	
  happens	
  in	
  pracQce	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In-­‐memory	
  representaQon	
  a	
  major	
  problem	
  
•  Any	
  algorithms	
  wrimen	
  against	
  a	
  data	
  frame	
  implementaQon	
  assume	
  a	
  parQcular	
  
custom	
  
• Memory	
  layout	
  
• Type	
  system	
  (+	
  missing	
  data	
  representaQon)	
  
• Memory	
  management	
  strategy	
  
•  Due	
  to	
  organic	
  growth	
  of	
  libraries	
  /	
  language	
  ecosystems,	
  there	
  was	
  never	
  an	
  
effort	
  to	
  reach	
  any	
  design	
  consensus	
  
•  Downstream	
  symptoms:	
  benchmark-­‐driven	
  development 	
  	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Lies,	
  damn	
  lies,	
  and	
  benchmarks	
  
•  Usual	
  targets	
  of	
  benchmarking:	
  
• IO	
  (CSV	
  and	
  database	
  reading)	
  
• Joins	
  
• AggregaQon	
  /	
  group-­‐by	
  operaQons	
  
•  What’s	
  actually	
  being	
  tested	
  
• Quality	
  of	
  algorithm	
  implementaQon	
  
• Data	
  representaQon	
  
• Memory	
  access	
  pamerns	
  /	
  CPU	
  cache	
  efficiency	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Example:	
  group-­‐by-­‐aggregate	
  
df.groupby(‘a’)
.b.sum()
SELECT a, sum(b) AS total
FROM df
GROUP BY 1
df %>% group_by(‘a’)
%>% summarise(total=sum(b))
What’s	
  actually	
  being	
  benchmarked	
  
•  CPU/IO	
  efficiency	
  for	
  for	
  scanning	
  a	
  and	
  b	
  
columns	
  
•  Speed	
  to	
  push	
  a	
  values	
  through	
  a	
  hash	
  table	
  
(quality	
  of	
  hash	
  table	
  impl	
  now	
  an	
  issue)	
  
•  Time	
  to	
  sum	
  b	
  values	
  given	
  known	
  
categorizaQon	
  (using	
  hash	
  table)	
  
•  Speed	
  of	
  creaQng	
  result	
  data	
  frame	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  types	
  
•  PrimiQve	
  value	
  types	
  
• Number	
  (integer,	
  floaQng	
  point)	
  
• Boolean	
  
• String	
  (UTF-­‐8),	
  Binary	
  
• Timestamp	
  
•  Complex	
  /	
  nested	
  types	
  
• Lists	
  
• Structs	
  
• Maps	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  types	
  
{
‘name’: {
‘first’: ‘Wes’,
‘last’: ‘McKinney’
}
‘age’: 30
‘numbers’: [1, 2, 3, 4]
‘addresses’: [
{‘city’: ‘New York’,
‘state’: ‘NY’},
{‘city’: ‘San Francisco’,
‘state’: ‘CA’},
]
‘keys’: [(‘foo’, 27), (‘baz’, 35)]
}
struct<
name: struct<
first: string,
last: string
>,
age: int32,
numbers: list<int32>,
addresses: list<struct<
city: string,
state: string
>>,
keys: map<string, int>
>
Example	
  Table	
  Row	
   Table	
  schema	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Table	
  memory	
  layout	
  
•  Columnar	
  
• Faster	
  for	
  analyQcs,	
  single-­‐column	
  scans	
  
• ProjecQons	
  (column	
  add/remove)	
  cheap	
  
• Row	
  appends	
  harder	
  
• OperaQons	
  across	
  single	
  rows	
  are	
  slower	
  
•  Row-­‐oriented	
  
• Slower	
  single-­‐column	
  scans	
  
• ProjecQons	
  expensive	
  
• Row	
  appends	
  easier	
  
• OperaQons	
  across	
  single	
  rows	
  are	
  faster	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SerializaQon	
  and	
  protocols	
  
•  Text-­‐based	
  formats	
  
• CSV/TSV,	
  JSON	
  
•  Binary	
  formats	
  
• Avro	
  
• Thrih	
  
• Protocol	
  buffers	
  
• Parquet	
  (related:	
  ColumnIO	
  and	
  RecordIO	
  at	
  Google)	
  
• ORCFile	
  
• HDF5	
  
• Language	
  specific:	
  NumPy,	
  bcolz,	
  Rdata,	
  etc.	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Lossy	
  /	
  text-­‐based	
  formats	
  
•  Like	
  CSV,	
  JSON,	
  etc.	
  
•  Can	
  be	
  read	
  “easily”	
  by	
  anyone	
  
•  Expensive	
  to	
  read	
  (parse)	
  and	
  write/generate	
  
•  Expensive	
  to	
  infer	
  correct	
  schema	
  if	
  not	
  known	
  
• SQll	
  must	
  validate	
  parsed	
  data,	
  even	
  if	
  you	
  know	
  the	
  schema	
  
•  Not	
  a	
  high	
  fidelity	
  format	
  
• Schema	
  must	
  be	
  known	
  	
  
• Data	
  can	
  be	
  easily	
  lost-­‐in-­‐translaQon	
  
• CSV	
  cannot	
  easily	
  handle	
  nested	
  schemas	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Binary	
  data	
  formats	
  
•  Some,	
  like	
  Parquet	
  and	
  ORCFile,	
  are	
  columnar	
  /	
  analyQcs-­‐oriented	
  
•  Others	
  (Avro,	
  Thrih,	
  Protobuf),	
  designed	
  for	
  more	
  general	
  data	
  transport	
  and	
  
remote	
  procedure	
  calls	
  (RPC)	
  in	
  distributed	
  systems	
  
•  Language-­‐specific	
  formats	
  generally	
  don’t	
  have	
  full-­‐featured	
  reader-­‐writers	
  
outside	
  the	
  primary	
  language	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  representaQon	
  across	
  languages	
  
•  R:	
  list	
  of	
  named	
  R	
  arrays	
  (each	
  containing	
  data	
  of	
  a	
  primiQve	
  R	
  type)	
  
• Data	
  frames	
  have	
  an	
  implicit	
  schema,	
  but	
  user	
  does	
  not	
  interact	
  with	
  
• Limited	
  support	
  for	
  nested	
  type	
  values	
  (JSON-­‐like	
  data)	
  
•  pandas:	
  complex,	
  but	
  fundamentally	
  data	
  stored	
  in	
  NumPy	
  arrays	
  
• No	
  explicit	
  table	
  schema;	
  NumPy	
  complexiQes	
  largely	
  hidden	
  from	
  user	
  
• Limited	
  /	
  no	
  built-­‐in	
  support	
  for	
  nested	
  type	
  values	
  
•  Missing	
  /	
  NULL	
  values	
  handled	
  on	
  a	
  type-­‐by-­‐type	
  basis	
  (someQmes	
  not	
  at	
  all)	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Missing	
  data	
  
•  R:	
  uses	
  special	
  values	
  to	
  encode	
  NA	
  
•  Python	
  pandas:	
  special	
  values,	
  but	
  does	
  not	
  work	
  for	
  all	
  types	
  
•  For	
  arbitrary	
  type	
  data,	
  best	
  approach	
  is	
  to	
  use	
  a	
  bit	
  or	
  byte	
  to	
  represent	
  
nullness/not-­‐nullness	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
I	
  am	
  acQvely	
  working	
  to	
  address	
  
infrastructural	
  issues	
  that	
  have	
  
prevented	
  collaboraQon	
  /	
  code	
  
sharing	
  in	
  tabular	
  data	
  tools	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  awesome	
  R	
  data	
  frame	
  stuff	
  
•  “Hadley	
  Stack”	
  
• dplyr,	
  Qdyr	
  
• legacy:	
  plyr,	
  reshape2	
  	
  
• ggplot2	
  
•  data.table	
  (data.frame	
  +	
  indices,	
  fast	
  algorithms)	
  
•  xts	
  :	
  Qme	
  series	
  
	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
R	
  data	
  frames:	
  rough	
  edges	
  
•  Copy-­‐on-­‐write	
  semanQcs	
  
•  API	
  fragmentaQon	
  /	
  inconsistency	
  
• Use	
  the	
  “Hadley	
  stack”	
  for	
  improved	
  sanity	
  
•  Factor	
  /	
  String	
  dichotomy	
  
• stringsAsFactors=FALSE	
  a	
  blessing	
  and	
  curse	
  
•  Somewhat	
  limited	
  type	
  system	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
dplyr	
  
•  Composable	
  table	
  API	
  
•  Good	
  example	
  of	
  what	
  the	
  “decoupled”	
  future	
  might	
  look	
  like	
  
• New	
  in-­‐memory	
  R/RCpp	
  execuQon	
  engine	
  
• SQL	
  backends	
  for	
  large	
  subset	
  of	
  API	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  DataFrames	
  
•  R/pandas-­‐inspired	
  API	
  for	
  tabular	
  data	
  manipulaQon	
  in	
  Scala,	
  Python,	
  etc.	
  
•  Logical	
  operaQon	
  graphs	
  rewrimen	
  internally	
  in	
  more	
  efficient	
  form	
  
•  Good	
  interop	
  with	
  Spark	
  SQL	
  
•  Some	
  interoperability	
  with	
  pandas	
  
•  ParQal	
  API	
  Decoupling!	
  
•  Low-­‐level	
  internals	
  (DataFrame	
  ~	
  RDD[Row])	
  are	
  …	
  not	
  the	
  best	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  Several	
  key	
  data	
  structures,	
  data	
  frame	
  among	
  them	
  
•  Considerably	
  more	
  complex	
  internals	
  than	
  other	
  data	
  frame	
  libraries	
  
•  Some	
  good	
  things	
  
• Born	
  of	
  need	
  
• A	
  “bameries	
  included”	
  approach	
  
• Hierarchical	
  axis	
  labeling:	
  addresses	
  some	
  hard	
  use	
  cases	
  at	
  expense	
  of	
  
semanQc	
  complexity	
  
• Strong	
  Qme	
  series	
  support	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas:	
  rough	
  edges	
  
•  Axis	
  labelling	
  can	
  get	
  in	
  the	
  way	
  for	
  folks	
  needing	
  “just	
  a	
  table”	
  
•  Ceded	
  control	
  of	
  its	
  type	
  system	
  /	
  data	
  rep’n	
  from	
  day	
  1	
  to	
  NumPy	
  
•  Inefficient	
  string	
  handling	
  (uses	
  NumPy	
  object	
  arrays)	
  
•  Missing	
  data	
  handling	
  less	
  precise	
  than	
  other	
  tools	
  
•  No	
  C	
  API;	
  very	
  difficult	
  for	
  external	
  tools	
  to	
  interact	
  with	
  pandas	
  objects	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
blaze	
  
•  Open	
  source	
  Python	
  project	
  run	
  by	
  ConQnuum	
  AnalyQcs	
  
•  Bespoke	
  type	
  system	
  (“datashape”)	
  supporQng	
  many	
  kinds	
  of	
  nested	
  schemas	
  
•  Data	
  expression	
  API	
  with	
  many	
  execuQon	
  backends	
  (SQL,	
  pandas,	
  Spark,	
  etc.)	
  
•  Does	
  not	
  have	
  a	
  “naQve”	
  backend	
  (was	
  to	
  be	
  the—now	
  aborted?—libdynd	
  
project)	
  
•  Another	
  good	
  example	
  of	
  the	
  “decoupled”	
  future	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
libdynd	
  
•  Next-­‐gen	
  NumPy-­‐in-­‐C++	
  project	
  started	
  by	
  ConQnuum	
  in	
  2011	
  to	
  be	
  a	
  naQve	
  
backend	
  for	
  Blaze,	
  appears	
  not	
  to	
  be	
  acQvely	
  pursued	
  
•  Implements	
  the	
  new	
  datashape	
  protocol	
  
•  Standalone	
  C++11/14	
  library,	
  with	
  deep	
  Python	
  binding	
  
•  Lots	
  of	
  interesQng	
  internals	
  /	
  implementaQon	
  ideas	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
bcolz:	
  the	
  new	
  carray	
  /	
  ctable	
  
•  Compressed	
  in-­‐memory	
  /	
  on	
  disk	
  columnar	
  table	
  structure	
  
•  Outgrowth	
  of	
  
• carray	
  (compressed-­‐in-­‐memory	
  NumPy	
  array)	
  
• blosc	
  (mulQthreaded	
  shuffling	
  compression	
  library)	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Julia:	
  DataFrames.jl	
  
•  Started	
  by	
  Harlan	
  Harris	
  &	
  co	
  
•  Part	
  of	
  broader	
  JuliaStats	
  iniQaQve	
  
• More	
  R-­‐like	
  than	
  pandas-­‐like	
  
• Very	
  acQve:	
  >	
  50	
  contributors!	
  	
  
•  SQll	
  comparaQvely	
  early	
  
• Less	
  comprehensive	
  API	
  
• More	
  limited	
  IO	
  capabiliQes	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  data	
  frames	
  
•  Saddle	
  (Scala)	
  
• Dev’d	
  by	
  Adam	
  Klein	
  (ex-­‐AQR)	
  at	
  Novus	
  Partners	
  (fintech	
  startup)	
  
• Designed	
  and	
  used	
  for	
  financial	
  use	
  cases	
  
•  Deedle	
  (F#	
  /	
  .NET)	
  
• Dev’d	
  by	
  AK’s	
  colleagues	
  at	
  BlueMountain	
  (hedge	
  fund)	
  
•  GraphLab	
  /	
  Dato	
  
• Really	
  good	
  C++	
  data	
  frame	
  with	
  Python	
  interface	
  
• Dual-­‐licensed:	
  AGPL	
  +	
  Commercial	
  
•  That’s	
  not	
  all!	
  Haskell,	
  Go,	
  etc…	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
We’re	
  not	
  done	
  yet	
  
•  The	
  future	
  is	
  JSON-­‐like	
  
• Support	
  for	
  nested	
  types	
  /	
  semi-­‐structured	
  data	
  is	
  sQll	
  weak	
  
•  Wanted:	
  Apache-­‐licensed,	
  community	
  standard	
  C/C++	
  data	
  frame	
  data	
  structure	
  
and	
  analyQcs	
  toolkit	
  that	
  we	
  all	
  use	
  (R,	
  Python,	
  Julia)	
  
•  Bring	
  on	
  the	
  Great	
  Decoupling	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  

More Related Content

What's hot

Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Extending Pandas using Apache Arrow and Numba
Extending Pandas using Apache Arrow and NumbaExtending Pandas using Apache Arrow and Numba
Extending Pandas using Apache Arrow and NumbaUwe Korn
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 

What's hot (20)

Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Extending Pandas using Apache Arrow and Numba
Extending Pandas using Apache Arrow and NumbaExtending Pandas using Apache Arrow and Numba
Extending Pandas using Apache Arrow and Numba
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 

Viewers also liked

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingKristian Alexander
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on SparkAlpine Data
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 

Viewers also liked (16)

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 

Similar to DataFrames: The Extended Cut Summary

PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Don Demcsak
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development SecuritySam Bowne
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edgeRam Kedem
 

Similar to DataFrames: The Extended Cut Summary (20)

PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
NoSQL-Overview
NoSQL-OverviewNoSQL-Overview
NoSQL-Overview
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 

More from Wes McKinney (16)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

DataFrames: The Extended Cut Summary

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   DataFrames:  The  Extended  Cut   Wes  McKinney  @wesmckinn   Open  Data  Science  Conference,  2015-­‐05-­‐31  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  Some  commentary  /  notes  on  all  the  data  frame  interfaces  out  there   •  Community  /  collaboraQon  challenges   •  Opinions  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer:  This  is  a  nuanced   discussion  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Originator  of  pandas  (2008  -­‐  )     •  Financial  analyQcs  in  R  /  Python  starQng  2007   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeQme   • Write  "Python  for  Data  Analysis”   •  2013-­‐2014:  DataPad  with  Chang  She  &  co   •  2014  -­‐  :  Open  source  development  @  Cloudera  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Data  frame,  in  brief   •  A  table-­‐like  data  structure   •  An  API  /  user  interface  for  the  table   • SelecQng  data   • Math  and  relaQonal  algebra  (join,  filter,  etc.)   • File  /  database  IO   • ad  infinitum  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Some  axes  of  comparison   •  Data  structure  internals  (types,  in-­‐memory  representaQon,  etc.)   •  Basic  table  API   •  RelaQonal  algebra  support   •  Group-­‐by  /  split-­‐apply-­‐combine  API   •  Performance,  memory  use,  evaluaQon  semanQcs   •  Missing  data   •  Data  Qdying  /  ETL  tools   •  IO  uQliQes   •  Domain  specific  tools  (e.g.  Qme  series)   •  …  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   The  Great  Data  Tool  Decoupling™   •  Thesis:  over  Qme,  user  interfaces,  data  storage,  and  execuQon  engines  will   decouple  and  specialize   •  In  fact,  you  should  really  want  this  to  happen   • Share  systems  among  languages   • Reduce  fragmentaQon  and  “lock-­‐in”   • Shih  developer  focus  to  usability     •  PredicQon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Crahing  quality  data  tools   •  Quality  /  usefulness  is  usually  forged  by  the  fire  of  bamle   •  Real  world  use  cases  and  social  proof  trump  theory   •  Eat  that  dog  food   •  When  in  doubt?  Look  at  the  test  suite.  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Data  frames,  circa  2015   •  Tight  coupling  amongst   • User  API   • In-­‐memory  data  representaQon   • SerializaQon/deserializaQon  tools   • AnalyQcs  /  computaQon   • Code  evaluaQon  semanQcs   •  Code  sharing  amongst  languages  fairly  difficult,  rarely  happens  in  pracQce  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   In-­‐memory  representaQon  a  major  problem   •  Any  algorithms  wrimen  against  a  data  frame  implementaQon  assume  a  parQcular   custom   • Memory  layout   • Type  system  (+  missing  data  representaQon)   • Memory  management  strategy   •  Due  to  organic  growth  of  libraries  /  language  ecosystems,  there  was  never  an   effort  to  reach  any  design  consensus   •  Downstream  symptoms:  benchmark-­‐driven  development    
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Lies,  damn  lies,  and  benchmarks   •  Usual  targets  of  benchmarking:   • IO  (CSV  and  database  reading)   • Joins   • AggregaQon  /  group-­‐by  operaQons   •  What’s  actually  being  tested   • Quality  of  algorithm  implementaQon   • Data  representaQon   • Memory  access  pamerns  /  CPU  cache  efficiency  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Example:  group-­‐by-­‐aggregate   df.groupby(‘a’) .b.sum() SELECT a, sum(b) AS total FROM df GROUP BY 1 df %>% group_by(‘a’) %>% summarise(total=sum(b)) What’s  actually  being  benchmarked   •  CPU/IO  efficiency  for  for  scanning  a  and  b   columns   •  Speed  to  push  a  values  through  a  hash  table   (quality  of  hash  table  impl  now  an  issue)   •  Time  to  sum  b  values  given  known   categorizaQon  (using  hash  table)   •  Speed  of  creaQng  result  data  frame  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Data  types   •  PrimiQve  value  types   • Number  (integer,  floaQng  point)   • Boolean   • String  (UTF-­‐8),  Binary   • Timestamp   •  Complex  /  nested  types   • Lists   • Structs   • Maps  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Data  types   { ‘name’: { ‘first’: ‘Wes’, ‘last’: ‘McKinney’ } ‘age’: 30 ‘numbers’: [1, 2, 3, 4] ‘addresses’: [ {‘city’: ‘New York’, ‘state’: ‘NY’}, {‘city’: ‘San Francisco’, ‘state’: ‘CA’}, ] ‘keys’: [(‘foo’, 27), (‘baz’, 35)] } struct< name: struct< first: string, last: string >, age: int32, numbers: list<int32>, addresses: list<struct< city: string, state: string >>, keys: map<string, int> > Example  Table  Row   Table  schema  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Table  memory  layout   •  Columnar   • Faster  for  analyQcs,  single-­‐column  scans   • ProjecQons  (column  add/remove)  cheap   • Row  appends  harder   • OperaQons  across  single  rows  are  slower   •  Row-­‐oriented   • Slower  single-­‐column  scans   • ProjecQons  expensive   • Row  appends  easier   • OperaQons  across  single  rows  are  faster  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   SerializaQon  and  protocols   •  Text-­‐based  formats   • CSV/TSV,  JSON   •  Binary  formats   • Avro   • Thrih   • Protocol  buffers   • Parquet  (related:  ColumnIO  and  RecordIO  at  Google)   • ORCFile   • HDF5   • Language  specific:  NumPy,  bcolz,  Rdata,  etc.  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Lossy  /  text-­‐based  formats   •  Like  CSV,  JSON,  etc.   •  Can  be  read  “easily”  by  anyone   •  Expensive  to  read  (parse)  and  write/generate   •  Expensive  to  infer  correct  schema  if  not  known   • SQll  must  validate  parsed  data,  even  if  you  know  the  schema   •  Not  a  high  fidelity  format   • Schema  must  be  known     • Data  can  be  easily  lost-­‐in-­‐translaQon   • CSV  cannot  easily  handle  nested  schemas  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Binary  data  formats   •  Some,  like  Parquet  and  ORCFile,  are  columnar  /  analyQcs-­‐oriented   •  Others  (Avro,  Thrih,  Protobuf),  designed  for  more  general  data  transport  and   remote  procedure  calls  (RPC)  in  distributed  systems   •  Language-­‐specific  formats  generally  don’t  have  full-­‐featured  reader-­‐writers   outside  the  primary  language  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Data  representaQon  across  languages   •  R:  list  of  named  R  arrays  (each  containing  data  of  a  primiQve  R  type)   • Data  frames  have  an  implicit  schema,  but  user  does  not  interact  with   • Limited  support  for  nested  type  values  (JSON-­‐like  data)   •  pandas:  complex,  but  fundamentally  data  stored  in  NumPy  arrays   • No  explicit  table  schema;  NumPy  complexiQes  largely  hidden  from  user   • Limited  /  no  built-­‐in  support  for  nested  type  values   •  Missing  /  NULL  values  handled  on  a  type-­‐by-­‐type  basis  (someQmes  not  at  all)  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Missing  data   •  R:  uses  special  values  to  encode  NA   •  Python  pandas:  special  values,  but  does  not  work  for  all  types   •  For  arbitrary  type  data,  best  approach  is  to  use  a  bit  or  byte  to  represent   nullness/not-­‐nullness  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   I  am  acQvely  working  to  address   infrastructural  issues  that  have   prevented  collaboraQon  /  code   sharing  in  tabular  data  tools  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Some  awesome  R  data  frame  stuff   •  “Hadley  Stack”   • dplyr,  Qdyr   • legacy:  plyr,  reshape2     • ggplot2   •  data.table  (data.frame  +  indices,  fast  algorithms)   •  xts  :  Qme  series    
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames:  rough  edges   •  Copy-­‐on-­‐write  semanQcs   •  API  fragmentaQon  /  inconsistency   • Use  the  “Hadley  stack”  for  improved  sanity   •  Factor  /  String  dichotomy   • stringsAsFactors=FALSE  a  blessing  and  curse   •  Somewhat  limited  type  system  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Composable  table  API   •  Good  example  of  what  the  “decoupled”  future  might  look  like   • New  in-­‐memory  R/RCpp  execuQon  engine   • SQL  backends  for  large  subset  of  API  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  DataFrames   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaQon  in  Scala,  Python,  etc.   •  Logical  operaQon  graphs  rewrimen  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  ParQal  API  Decoupling!   •  Low-­‐level  internals  (DataFrame  ~  RDD[Row])  are  …  not  the  best  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Several  key  data  structures,  data  frame  among  them   •  Considerably  more  complex  internals  than  other  data  frame  libraries   •  Some  good  things   • Born  of  need   • A  “bameries  included”  approach   • Hierarchical  axis  labeling:  addresses  some  hard  use  cases  at  expense  of   semanQc  complexity   • Strong  Qme  series  support  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   pandas:  rough  edges   •  Axis  labelling  can  get  in  the  way  for  folks  needing  “just  a  table”   •  Ceded  control  of  its  type  system  /  data  rep’n  from  day  1  to  NumPy   •  Inefficient  string  handling  (uses  NumPy  object  arrays)   •  Missing  data  handling  less  precise  than  other  tools   •  No  C  API;  very  difficult  for  external  tools  to  interact  with  pandas  objects  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   blaze   •  Open  source  Python  project  run  by  ConQnuum  AnalyQcs   •  Bespoke  type  system  (“datashape”)  supporQng  many  kinds  of  nested  schemas   •  Data  expression  API  with  many  execuQon  backends  (SQL,  pandas,  Spark,  etc.)   •  Does  not  have  a  “naQve”  backend  (was  to  be  the—now  aborted?—libdynd   project)   •  Another  good  example  of  the  “decoupled”  future  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   libdynd   •  Next-­‐gen  NumPy-­‐in-­‐C++  project  started  by  ConQnuum  in  2011  to  be  a  naQve   backend  for  Blaze,  appears  not  to  be  acQvely  pursued   •  Implements  the  new  datashape  protocol   •  Standalone  C++11/14  library,  with  deep  Python  binding   •  Lots  of  interesQng  internals  /  implementaQon  ideas  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   bcolz:  the  new  carray  /  ctable   •  Compressed  in-­‐memory  /  on  disk  columnar  table  structure   •  Outgrowth  of   • carray  (compressed-­‐in-­‐memory  NumPy  array)   • blosc  (mulQthreaded  shuffling  compression  library)  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Julia:  DataFrames.jl   •  Started  by  Harlan  Harris  &  co   •  Part  of  broader  JuliaStats  iniQaQve   • More  R-­‐like  than  pandas-­‐like   • Very  acQve:  >  50  contributors!     •  SQll  comparaQvely  early   • Less  comprehensive  API   • More  limited  IO  capabiliQes  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Other  data  frames   •  Saddle  (Scala)   • Dev’d  by  Adam  Klein  (ex-­‐AQR)  at  Novus  Partners  (fintech  startup)   • Designed  and  used  for  financial  use  cases   •  Deedle  (F#  /  .NET)   • Dev’d  by  AK’s  colleagues  at  BlueMountain  (hedge  fund)   •  GraphLab  /  Dato   • Really  good  C++  data  frame  with  Python  interface   • Dual-­‐licensed:  AGPL  +  Commercial   •  That’s  not  all!  Haskell,  Go,  etc…  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  not  done  yet   •  The  future  is  JSON-­‐like   • Support  for  nested  types  /  semi-­‐structured  data  is  sQll  weak   •  Wanted:  Apache-­‐licensed,  community  standard  C/C++  data  frame  data  structure   and  analyQcs  toolkit  that  we  all  use  (R,  Python,  Julia)   •  Bring  on  the  Great  Decoupling  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn