Your SlideShare is downloading. ×
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

2,523

Published on

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,523
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
27
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Last Mile: Challenges and opportunities in data tools Strata Santa Clara 2014
  • 2. Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of pandas • Author of 
 Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  • 3. 3 www.datapad.io
  • 4. • http://datapad.io • New web-based visual analytics environment • In private beta, join us! • Hiring for engineering 4 www.datapad.io
  • 5. Some Problems • • Statistics and ML ETL • • Data Visualization Workflows + Collaboration • Business Analytics 5 www.datapad.io
  • 6. Data toolchains Data Acquisition Data Slinging / Management ETL SQL / Tidy Form Code-based Env 6 UI-based Env www.datapad.io Analysis
  • 7. Data toolchains Data Acquisition Maybe HDFS ETL ETL Analytic DBMS Code-based Env 7 ETL? www.datapad.io UI-based Env
  • 8. Some Trends • • SQL-on-Hadoop Spark / Spark ecosystem • New life in visual ETL / data prep • Better data manipulation libraries • Columnar / analytic databases 8 www.datapad.io
  • 9. Crunching data with code • Python: pandas • Data frames in Scala, F#, Julia, … • Spark (Scala/Java) • R (+ data.table, dplyr) 9 www.datapad.io
  • 10. Some Programmatic Tool Problems • Awkward / slow DB interactions • In-process memory management Reuse of intermediate results • Execution speed • • Evaluation semantics 10 www.datapad.io
  • 11. dplyr (R library) • By Hadley Wickham and Romain Francois • Uniform R API, SQL and in-memory backends • Describe complex data manipulation using “chaining” 11 www.datapad.io
  • 12. dplyr (R library) final %.% %.% %.% %.% %.% %.% 12 <- crime.by.state filter(State=="New York", Year==2005) arrange(desc(Count)) select(Type.of.Crime, Count) mutate(Proportion=Count/sum(Count)) group_by(Type.of.Crime) summarise(num.types = n(), counts = sum(Count)) www.datapad.io
  • 13. Apache Spark • Broad set of primitive data ops • Distributed in-memory model scales naturally, high performance • Build complex computation graphs for analytics • Applications: Shark, GraphX, … 13 www.datapad.io
  • 14. pandas (Python library) • Broad traction • Strong feature: time series analytics User-friendly API and community • Being used in many unexpected • ways 14 www.datapad.io
  • 15. badger (DataPad internal) • A high performance in-memory analytics engine for DataPad • Addresses many performance and memory management concerns in pandas • May become an OSS project someday 15 www.datapad.io
  • 16. Standardized machine learning toolkits • scikit-learn • PMML • Mahout Cloudera ML • 16 www.datapad.io
  • 17. Enterprise data workflows • • Apache Crunch Pig • Cascading (+ Scalding, Cascalog) 17 www.datapad.io
  • 18. Analytic databases • Powering visual analytics tools on big data • • MPP / in-memory execution model Compressed columnar storage 18 www.datapad.io
  • 19. Visual data tools • Visual Analytics/BI gone mainstream • New Data Prep products • Drag-and-drop predictive analytics • Proliferation of vertical SaaS solutions 19 www.datapad.io
  • 20. Visual tool challenges • Tend to be less flexible than code • Multiple tools to get the job done • Many still dependent on Excel • Collaboration, versioning, provenance 20 www.datapad.io
  • 21. Collaboration tools • Discovery and reuse • Cataloguing insights • Analytics from ad-hoc to production • Interesting projects: IPython Notebook, Shiny, Pivotal Chorus 21 www.datapad.io
  • 22. Some ideas 22 www.datapad.io
  • 23. Abstract away the execution model (where possible) 23 www.datapad.io
  • 24. More integrated environments 24 www.datapad.io
  • 25. Enhance collaboration 25 www.datapad.io
  • 26. Thank you!

×