Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

400 views

Published on

You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing.
Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries:
Do I need to know SQL to use dplyr?
When is a “tbl” not a “tibble”?
Why is 1 not always equal to 1?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?

3 things to learn:
Do I need to know SQL to use dplyr?
When should you collect(), collapse(), and compute()?
How can you use dplyr to combine data stored in different systems?

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

  1. 1. 1© Cloudera, Inc. All rights reserved. dplyr Interfaces to Large-Scale Data Ian Cook @ianmcook ian@cloudera.com
  2. 2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context
  3. 3. 3© Cloudera, Inc. All rights reserved. Poll question
  4. 4. 4© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API SQL or DataFrame API
  5. 5. 5© Cloudera, Inc. All rights reserved. Poll question
  6. 6. 6© Cloudera, Inc. All rights reserved. ] SQLPySpark SparkR SQL dplyr
  7. 7. 7© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr
  8. 8. 8© Cloudera, Inc. All rights reserved. Poll question
  9. 9. 9© Cloudera, Inc. All rights reserved. Demonstration Example code at github.com/ianmcook/dplyr-examples
  10. 10. 10© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional
  11. 11. 11© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr
  12. 12. 12© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Impala • Uses ODBC or JDBC to connect to Impala • Developed at Cloudera tiny.cloudera.com/implyr implyr implyr
  13. 13. 13© Cloudera, Inc. All rights reserved. Five tips for using dplyr with SQL data sources
  14. 14. 14© Cloudera, Inc. All rights reserved. Use show_query() 1
  15. 15. 15© Cloudera, Inc. All rights reserved. filter() early arrange() late 2
  16. 16. 16© Cloudera, Inc. All rights reserved. Check your data types 3
  17. 17. 17© Cloudera, Inc. All rights reserved. Know your SQL engine 4
  18. 18. 18© Cloudera, Inc. All rights reserved. Know when to collect() 5
  19. 19. 19© Cloudera, Inc. All rights reserved. Questions? Ian Cook @ianmcook ian@cloudera.com
  20. 20. 20© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench More information tiny.cloudera.com/cdsw OnDemand training tiny.cloudera.com/cdsw-training

×