Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Up Your Workflows: Remix Your Data with Databricks and HDInsight

148 views

Published on

When your data is too much for one system to handle, you have Big Data. Find out how to use the new Alteryx Spark Direct connector and Spark Code tool with Databricks and Microsoft Azure HDInsight to bring cloud-based Spark analytics to your workflows. We cover setting up the connections, building workflows, and using the Spark Code tool to take advantage of the powerful features offered by Databricks and HDInsight. Also learn how to set up a Databricks cluster for use with Alteryx Designer and the Spark Direct connector, including the required Amazon Web Services instances and Microsoft Azure HDInsight.

David Wilcox - Senior Software Engineer - Alteryx

Published in: Data & Analytics
  • Be the first to comment

Spark Up Your Workflows: Remix Your Data with Databricks and HDInsight

  1. 1. # A L T E R Y X 1 8# A L T E R Y X 1 8 SPARK UP YOUR WORKFLOWS Presented by David Wilcox
  2. 2. # A L T E R Y X 1 8 This presentation includes “forward-looking statements” within the meaning of the Private Securities Litigation Reform Act of 1995. These forward-looking statements may be identified by the use of terminology such as “believe,” “may,” “will,” “intend,” “expect,” “plan,” “anticipate,” “estimate,” “potential,” or “continue,” or other comparable terminology. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product availability, growth and financial metrics and any statements regarding product roadmaps, strategies, plans or use cases. Although Alteryx believes that the expectations reflected in any of these forward-looking statements are reasonable, these expectations or any of the forward-looking statements could prove to be incorrect, and actual results or outcomes could differ materially from those projected or assumed in the forward-looking statements. Alteryx’s future financial condition and results of operations, as well as any forward-looking statements, are subject to risks and uncertainties, including but not limited to the factors set forth in Alteryx’s press releases, public statements and/or filings with the Securities and Exchange Commission, especially the “Risk Factors” sections of Alteryx’s Quarterly Report on Form 10-Q. These documents and others containing important disclosures are available at www.sec.gov or in the “Investors” section of Alteryx’s website at www.alteryx.com. All forward-looking statements are made as of the date of this presentation and Alteryx assumes no obligation to update any such forward-looking statements. Any unreleased services or features referenced in this or other presentations, press releases or public statements are only intended to outline Alteryx’s general product direction. They are intended for information purposes only, and may not be incorporated into any contract. This is not a commitment to deliver any material, code, or functionality (which may not be released on time or at all) and customers should not rely upon this presentation or any such statements to make purchasing decisions. The development, release, and timing of any features or functionality described for Alteryx’s products remains at the sole discretion of Alteryx. FORWARD-LOOKING STATEMENTS
  3. 3. # A L T E R Y X 1 8 AGE- NDA • Introduction • Big Data? • Alteryx versus Apache Spark • Alteryx with Apache Spark • Contrived example • Tips • Q&A
  4. 4. # A L T E R Y X 1 8 DAVID WILCOX ALTERYX USER SINCE 2015 When I use Alteryx, I feel in control of my success With Alteryx, I can do big things with small effort.
  5. 5. # A L T E R Y X 1 8# A L T E R Y X 1 8 BIG DATA?
  6. 6. # A L T E R Y X 1 8 IT’S NOT BIG DATA, IT’S BIG WORK L o o k b e yo n d t h e 3 … 5 … 6 … 7 … 1 0 V s • Velocity, volume, variety, variability, veracity, validity, vulnerability, volatility, visualization, value are all important characteristics of the data • However, when thinking about Big Data problems, think about them as Big Work – What work are you doing with the data? – How fast do you need to do that work? – What resources do you need to do that work?
  7. 7. # A L T E R Y X 1 8# A L T E R Y X 1 8 ALTERYX VS APACHE SPARK
  8. 8. # A L T E R Y X 1 8 BIG DATA STRENGTHS A l t e r y x • Obviously very friendly to code-free solutions, even with complex problems – Plus, we’re code-friendly, too! • Can handle very large data sets • Our community A p a ch e S p a r k • Highly scalable • Handles data sets as large as you can store • Huge ecosystem of libraries and frameworks
  9. 9. # A L T E R Y X 1 8 BIG DATA CHALLENGES A l t e r y x • Complex solutions that can’t be handled with tools require coding – We’re code-friendly, but you need still need someone to write the code A p a ch e S p a r k • All solutions require writing code • Clusters have to be deployed, managed and maintained
  10. 10. # A L T E R Y X 1 8# A L T E R Y X 1 8 ALTERYX WITH APACHE SPARK
  11. 11. # A L T E R Y X 1 8 BEST OF BOTH WORLDS C o d e - f r e e , c o d e - f r i e n d l y, a n d s c a l a b l e • Create Alteryx workflows that integrate Apache Spark using Apache Spark Direct in-database connections – Even complex problems can be code-free • Add custom logic in Python, Scala, or R using the Apache Spark Code tool • Scale out the size of your Apache Spark cluster as your work gets bigger
  12. 12. # A L T E R Y X 1 8 BUT WAIT Alteryx workflows are easy. Aren’t Apache Spark clusters hard? “Clusters have to be deployed, managed and maintained” –Me, three slides back
  13. 13. # A L T E R Y X 1 8 APACHE SPARK IN THE CLOUD • Deploy and scale clusters as needed • Simplified management • Minimal maintenance • Easy integration with Alteryx – Databricks on AWS in 2018.2 – Databricks on Azure coming soon – Microsoft Azure HDInsight coming soon
  14. 14. # A L T E R Y X 1 8 DATABRICKS A p a ch e S p a r k i n t h e cl o u d f r o m t h e c r e at o r s o f A p a ch e S p a r k • Founded by the creators of the Spark research project at UC Berkeley • Web-based management interface for creating and deploying Apache Spark clusters • Notebooks for prototyping code • Apache Spark Direct on Databricks on AWS added in Alteryx 2018.2 • Apache Spark Direct on Databricks on Azure coming soon
  15. 15. # A L T E R Y X 1 8 DATABRICKS CONFIGURATION
  16. 16. # A L T E R Y X 1 8 MICROSOFT AZURE HDINSIGHT A p a ch e S p a r k o n t h e M i c r o s o f t A z u r e p l at fo r m • Managed and supported solution on the Microsoft Azure platform • Can use Azure Data Lake for storage • Notebooks for prototyping code • Apache Spark Direct on Microsoft Azure HDInsight coming soon
  17. 17. # A L T E R Y X 1 8 MICROSOFT AZURE HDINSIGHT CONFIGURATION
  18. 18. # A L T E R Y X 1 8# A L T E R Y X 1 8 CONTRIVED EXAMPLE
  19. 19. # A L T E R Y X 1 8 THE PERFECT WEATHER • 20 - 27 degrees Celsius • 30 – 60% relative humidity • No more than a light breeze • Stable most of the year
  20. 20. # A L T E R Y X 1 8 THE SOURCE DATA A w h o l e l o t o f t ex t f i l e s • NOAA hourly integrated surface data • 28,175 weather stations around the world, with data from 1948 to 2018 • 2,106,308,021 records • 606,186 text files totaling 122 GB
  21. 21. # A L T E R Y X 1 8 CONSUME MASS QUANTITIES P r o c e s s i n g a l l o f t h at d at a w i t h A l t e r y x • Two workflows – – One to load the weather station metadata – One to process the actual hourly weather data files and join that with the weather station metadata • 7 hours 53 minutes to complete on a desktop computer with an Intel i7-7700K CPU and 32 GB RAM • Output was a single 47.4 GB yxdb file
  22. 22. # A L T E R Y X 1 8 FINDING THE PERFECT SPOT B e s t w e at h e r l o c at i o n i n t h e U n i t e d S t at e s • Use only weather stations in the United States and data from 2016 to present • Which location has the most close-to-perfect weather hours overall and is the least variable?
  23. 23. # A L T E R Y X 1 8 SPEED TEST A l t e r y x • 72 seconds A p a ch e S p a r k • 48 seconds
  24. 24. # A L T E R Y X 1 8 THE PERFECT SPOT(S) • Jordan, Montana • Brackett Field, California • Clayton Lake, Maine • El Monte, California • Palo Alto, California
  25. 25. # A L T E R Y X 1 8# A L T E R Y X 1 8 TIPS
  26. 26. # A L T E R Y X 1 8 PUT YOUR DATA WHERE YOUR COMPUTE IS • Minimize transfers of data between Alteryx and Apache Spark
  27. 27. # A L T E R Y X 1 8 USE THE RIGHT TOOL • Apache Spark is a sledgehammer. Don’t crack nuts with it • A few lines of Apache Spark code may be better than a complicated Alteryx workflow
  28. 28. # A L T E R Y X 1 8 GET FRIENDLY WITH THE CODE FRIENDLY • Yes, code-free is awesome • Code-friendly opens up more awesome – Add deep learning logic to your Alteryx workflow using the Apache Spark Code tool, for example
  29. 29. # A L T E R Y X 1 8 BE PATIENT • Apache Spark changes often and for the better
  30. 30. # A L T E R Y X 1 8# A L T E R Y X 1 8 QUESTIONS?
  31. 31. # A L T E R Y X 1 8# A L T E R Y X 1 8 THANK YOU David Wilcox dwilcox@alteryx.com @davidtwilcox linkedin.com/in/davidtwilcox Please complete a feedback survey! Relive the Excitement of Inspire here.
  32. 32. # A L T E R Y X 1 8 RESOURCES T h i n g s t o h e l p yo u i n yo u r A l t e r y x + A p a ch e S p a r k j o u r n e y • Apache Spark beginner resources: https://sparkhub.databricks.com/resources/ • Databricks CLI: https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html • Harness the Power of Your Data Lake with Alteryx Spark Direct: https://community.alteryx.com/t5/Data-Science-Blog/Harness-the-Power-of-Your-Data-Lake-with- Alteryx-Spark-Direct/ba-p/83012

×