Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Check these out next
Apache Arrow and Python: The latest
Wes McKinney
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
High Performance Python on Apache Spark
Wes McKinney
DataFrames: The Good, Bad, and Ugly
Wes McKinney
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
Enabling Python to be a Better Big Data Citizen
Wes McKinney
1
of
22
Top clipped slide
An Incomplete Data Tools Landscape for Hackers in 2015
Mar. 7, 2015
•
0 likes
24 likes
×
Be the first to like this
Show More
•
8,125 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
Overview talk for a mostly Python/R crowd in Minneapolis
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
5.4K views
•
37 slides
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
Improving data interoperability in Python and R
Wes McKinney
2.6K views
•
14 slides
DataFrames: The Extended Cut
Wes McKinney
8.5K views
•
34 slides
More Related Content
Slideshows for you
(20)
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
•
12.9K views
High Performance Python on Apache Spark
Wes McKinney
•
16.5K views
DataFrames: The Good, Bad, and Ugly
Wes McKinney
•
12.9K views
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
•
2.9K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.7K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
•
103.8K views
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
•
7.6K views
Apache Spark Briefing
Thomas W. Dinsmore
•
4K views
Light Up Your Dark Data
Anaconda
•
3.2K views
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
•
19K views
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
•
4.4K views
Viewers also liked
(16)
Hacking
Karan Poshattiwar
•
2.5K views
User Experience for Business Analysts
Carol Smith
•
3.7K views
Riding the Enterprise Integration train
Dominopoint - Italian Lotus User Group
•
283 views
Salesforce DX Pilot Product Overview
Salesforce Partners
•
3.2K views
Productive Data Tools for Quants
Wes McKinney
•
1.7K views
How To Be A Hacker
Paul Tarjan
•
6.1K views
Hacking For Innovation Delhi
Christian Heilmann
•
5.7K views
Success Community Wizard Overview
David Giller
•
267 views
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
•
11.7K views
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
•
132.8K views
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
•
158K views
pandas: Powerful data analysis tools for Python
Wes McKinney
•
9.7K views
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.7K views
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
Carol Smith
•
6K views
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
정훈 서
•
4.8K views
Startup Pitch Decks
Steve Schlafman
•
79.1K views
Advertisement
Similar to An Incomplete Data Tools Landscape for Hackers in 2015
(20)
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
•
1.8K views
PyData Boston 2013
Travis Oliphant
•
3.9K views
Improving Data Interoperability for Python and R
Work-Bench
•
10.3K views
High-Performance Python On Spark
Jen Aman
•
1.7K views
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
•
7.1K views
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
•
648 views
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
•
7.3K views
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
•
5.4K views
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
•
146 views
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
•
2.4K views
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
•
1K views
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
•
350 views
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
•
366 views
Keynote at Converge 2019
Travis Oliphant
•
468 views
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
•
103 views
.NET per la Data Science e oltre
Marco Parenzan
•
179 views
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
•
1.5K views
More from Wes McKinney
(17)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
996 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Advertisement
Recently uploaded
(20)
Supply Chain Attacks
Lionel Faleiro
•
0 views
The Truth Revealed: Can Airpods still be tracked after reset?
AffanIT1
•
2 views
Lists_tuples.pptx
M Vishnuvardhan Reddy
•
0 views
Python para equipos de ciberseguridad(pycones)
Jose Manuel Ortega Candel
•
0 views
Developing Using Meta Pool in Aurora
Neven6
•
0 views
Digital Mining Solutions Simplified With VTPL
UpasnaBagrodia
•
0 views
The Benefits of Adopting the Lamp Tech Stack: A Comprehensive Guide:
Brain Inventory
•
0 views
Future Managers with New Technology.pptx
Simhadri Bhavana
•
0 views
SEARCH-ENGINE-OPTIMIZATION-SEO-BEGINNERS-TOOLS(1).pdf
GeraldNsofor
•
0 views
HTML_Images.ppt
jaspreetkaur908049
•
2 views
Python Sets_Dictionary.pptx
M Vishnuvardhan Reddy
•
0 views
Python para equipos de ciberseguridad
Jose Manuel Ortega Candel
•
0 views
Hoshizaki HS-0199 - Motor, Pump - PartsFe.pdf
PartsFe
•
3 views
GIGALIGHT-Optical Transceiver SiPh MZ Quad Operating Point Locking
Gigalight
•
0 views
technology-140611171318-phpapp02 (1).pptx
jaspreetkaur908049
•
2 views
Harnessing AI and Web3 for Operators
AndyHall865717
•
0 views
wasp_2023.pdf
Felix Dobslaw
•
0 views
Open Data Center Interconnect Products and Solutions Brochure -GL.pdf
Gigalight
•
3 views
Using Data-Driven Agile Automation to Advance Digital Transformation
Precisely
•
0 views
Implementing cert-manager in K8s
Jose Manuel Ortega Candel
•
0 views
An Incomplete Data Tools Landscape for Hackers in 2015
1 © Cloudera,
Inc. All rights reserved. A [Incomplete] Data Tools Landscape [for Hackers] in 2015 Wes McKinney @wesmckinn Data^3 MeeMng — Minneapolis, MN
2 © Cloudera,
Inc. All rights reserved. This talk • A parMal look at different languages and tools • LimiMng scope to either: • Permissively licensed open source soSware, e.g. Apache-‐licensed (OSS) • Non-‐dual-‐licensed copyleS OSS (e.g. GPL) • i.e. “do you [the community] have any incenMve to create patches?” • Some trends (that I see, anyway) • Challenges and opportuniMes
3 © Cloudera,
Inc. All rights reserved. Who am I? • Python data firestarter • Financial analyMcs in R / Python starMng 2007 • pandas project born of frustraMon in 2008 • 2010-‐2012 • Hiatus from gainful employment • Make pandas ready for primeMme • Write "Python for Data Analysis"
4 © Cloudera,
Inc. All rights reserved. Who am I? (cont’d) • 2013-‐2014: Co-‐founder/CEO of DataPad (analyMcs startup, with early pandas collaborator Chang She) • Late 2014: DataPad team joins Cloudera • Now: backend systems and all-‐things-‐Python @ Cloudera
5 © Cloudera,
Inc. All rights reserved. SQL: SMll a lingua franca • “SQL: the Fortran of AnalyMcs” • OSen a concise, declaraMve way to express data transforms, analyMcs, etc. • RelaMvely easy to parse, analyze • SQL recently has seen resurgence with focus on interacMve-‐speed SQL engines, especially on top of HDFS/Hadoop • Relevant and impaclul features (e.g. JSON support) sMll arriving in established RDBMS like PostgreSQL
6 © Cloudera,
Inc. All rights reserved. Historical Python Context • ScienMfic / HPC compuMng focus in 1990s, 2000s • Python web community developed in parallel, matured faster! • NumPy became community standard in 2005, born from Numeric + Numarray • Pyrex, later Cython, easier C / C++ wrapping • f2py: easy Fortran wrapping • Anaconda distribuMon • Finally solving Python deployment for all
7 © Cloudera,
Inc. All rights reserved. EssenMal Python stack • NumPy: low-‐level array processing • SciPy: essenMal computaMonal algos • pandas: data wrangling • scikit-‐learn: machine learning • matplotlib (+ add-‐ons, like seaborn): visualizaMon • numba: numeric hotspot LLVM compiler • Domain-‐specific toolkits: nltk, scikit-‐image, statsmodels, Theano, PyCUDA/ PyOpenCL and many others
8 © Cloudera,
Inc. All rights reserved. pandas • A Pythonic take on the classic R “data frame” data structure • CriMcal piece to make the Python stack useful in everyday work • Added axis metadata / labeling for represenMng mulMdimensional data • Focus on easy data wrangling, IO, ploung, and basic analyMcs
9 © Cloudera,
Inc. All rights reserved. Jeff Reback’s “pandas as PyData middleware” diagram
10 © Cloudera,
Inc. All rights reserved. Newer / Up-‐and-‐coming Python projects • Bokeh: interacMve / reacMve visualizaMon for the web • Blaze: uniform data expression API • Odo: easy data migraMon
11 © Cloudera,
Inc. All rights reserved. R Project • Trusted base of staMsMcs libraries • Latest and greatest stats research oSen hits R first • RStudio • The "Hadley stack” • VisualizaMon: ggplot2 (staMc) and ggvis (interacMve) • Data Wrangling: dplyr • legacy: plyr / reshape2
12 © Cloudera,
Inc. All rights reserved. dplyr • Started late 2012 by Hadley Wickham, supported by RStudio • Composable / chainable analyMcs and data wrangling expressions • In-‐memory and SQL backends • Has avracted folks back to R from Python in a lot of cases
13 © Cloudera,
Inc. All rights reserved. Some other great R stuff • shiny: interacMve web apps in R • Rcpp • data.table • xts
14 © Cloudera,
Inc. All rights reserved. IPython • IPython started out as a bever interacMve Python • Grew to include web-‐based computaMonal notebook, GUI console, and other components • (Google even integrated into Google Drive!) • IPython Notebook architecture enabled “kernel” processes to be wriven in nearly any language (even bash!) • How to build community beyond Python?
15 © Cloudera,
Inc. All rights reserved. Enter Jupyter • hvp://jupyter.org • Breaking out notebook machinery into a standalone non-‐Python-‐specific project • Enable project components to evolve at own pace, without large monolithic releases • JupyterHub: upcoming mulM-‐user notebook server
16 © Cloudera,
Inc. All rights reserved. A few words about Hadoop + Big Data
17 © Cloudera,
Inc. All rights reserved. Apache Spark • Originated from Berkeley AMPLab • General purpose distributed memory-‐centric data processing framework • Official APIs: Scala, Java, Python Source: databricks.com
18 © Cloudera,
Inc. All rights reserved. Spark 1.3: DataFrames! • R/pandas-‐inspired API for tabular data manipulaMon in Scala, Python, etc. • Logical operaMon graphs rewriven internally in more efficient form • Good interop with Spark SQL • Some interoperability with pandas • Will help close the semanMc gap between Spark and R/Python
19 © Cloudera,
Inc. All rights reserved. Some problems in need of solving • A Shiny-‐like quick-‐and-‐dirty data app development framework for Python • IPython/Jupyter notebook collaboraMon • A community-‐standard, Apache-‐licensed C/C++ data frame library with best-‐in-‐ class performance • Ubiquitous support for emerging analyMcal on-‐disk storage standards like Parquet
20 © Cloudera,
Inc. All rights reserved. Other interesMng stuff to look at • Torch7 / LuaJIT: high performance ML / deep learning on GPUs • Facebook AI group open sourced several ML modules • Apache Flink • Up-‐and-‐coming Scala-‐based data processing framework • Some overlap with Spark use cases
21 © Cloudera,
Inc. All rights reserved. Some other interesMng industry trends • MicrosoS • Acquired RevoluMon AnalyMcs, leading commercial R vendor • Launched Azure ML: R, Python, and more on Azure cloud • Dato (ya GraphLab) • faster, more scalable machine learning, with Python interface (Paid commercial product, free for non-‐commercial/academic use) • Largest-‐ever VC investment in a data tools company beung big on Python • Databricks • Offering cloud Spark-‐notebook-‐as-‐a-‐service
22 © Cloudera,
Inc. All rights reserved. Thank you @wesmckinn
Advertisement