Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using PySpark to Process Boat Loads of Data

Learn how to use PySpark for processing massive amounts of data. Combined with the GitHub repo - - this presentation will help you gain familiarity with processing data using Python and Spark.

If you're thinking about machine learning and not sure if it can help improve your business, but want to find out, set up a free 20-minute consultation with us:

  • Login to see the comments

Using PySpark to Process Boat Loads of Data

  1. 1. Using PySpark to Process Boat Loads of Data Robert Dempsey, CEO Atlantic Dominion Solutions
  2. 2. We’ve mastered three jobs so you can focus on one - growing your business.
  3. 3. The Three Jobs At Atlantic Dominion Solutions we perform three functions for our customers: Consulting: we assess and advise in the areas of technology, team and process to determine how machine learning can have the biggest impact on your business. Implementation: after a strategy session to determine the work you need we get to work using our proven methodology and begin delivering smarter applications. Training: continuous improvement requires continuous learning. We provide both on-premises and online training.
  4. 4. Co-authoring the book Building Machine Learning Pipelines. Written for software developers and data scientists, Building Machine Learning Pipelines teaches the skills required to create and use the infrastructure needed to run modern intelligent systems. Writing the Book
  5. 5. Robert Dempsey, CEO Software Engineer Books and online courses Lotus Guides, District Data Labs Atlantic Dominion Solutions, LLC Professional Author Instructor Owner
  6. 6. What You Can Expect Today
  7. 7. MTAC Framework™ Mindset Toolbox Application Communication
  8. 8. 1. When acquiring knowledge start by going wide instead of deep. 2. Always focus on what's important to people rather than just the technology. 3. Be able to clearly communicate what you know with others. Core Principles
  9. 9. MTAC Framework™ Applied Mindset: use-case centric example Toolbox: Python, PySpark, Docker Application: Code & Analysis Communication: Q&A
  10. 10. Mindset
  11. 11. Keep It Simple Image: Jesse van Dijk :
  12. 12. Solve the Problem Image: Paulo :
  13. 13. Explain It, Simply
  14. 14. Break Through
  15. 15. Use Case
  16. 16. Got Clean Air?
  17. 17. Got Clean Air? • Clean air is important. • Toxic pollutants are known or suspected of causing cancer, reproductive effects, birth defects, and adverse environmental effects.
  18. 18. Questions to Answer 1. Which state has the highest level of pollutants? 2. Which county has the highest level of pollutants? 3. What are the top 5 pollutants by unit of measure? 4. What are the trends of pollutants by state over time?
  19. 19. Toolbox
  20. 20. Python
  21. 21. Spark
  22. 22. The Core of Spark • Computational engine that schedules, distributes and monitors computational tasks running on a cluster
  23. 23. Higher Level Tools • Spark SQL: SQL and structured data • MLlib: machine learning • GraphX: graph processing • Spark Streaming: process streaming data
  24. 24. Storage • Local file system • Amazon S3 • Cassandra • Hive • HBase • File formats • Text files • Sequence files • Avro • Parquet • Hadoop Input Format
  25. 25. Hadoop? • Not necessary, but… • If you have multiple nodes you need a resource manager like YARN or Mesos • You'll need access to distributed storage like HDFS, Amazon S3 or Cassandra
  26. 26. PySpark
  27. 27. What Is PySpark? • An API that exposes the Spark programming model to Python • Build on top of Spark's Java API • Data is processed with Python and cached/shuffled in the JVM • Driver programs
  28. 28. Driver Programs • Launch parallel operations on a cluster • Contain application functions • Define distributed datasets • Access Spark through a SparkContext • Uses Py4J to launch a JVM and create a JavaSparkContext
  29. 29. When to Use It • When you need to… • Process boat loads of data (TB) • Perform operations that require all the data to be in memory (machine learning) • Efficiently process streaming data • Create an overly complicated use case to present at a meetup
  30. 30. Docker
  31. 31. Docker • Software container platform • Containers are application only (no OS) • Deployed anywhere with same CPU architecture (x86-64, ARM) • Available for *nix, Mac, Windows
  32. 32. Container Architecture
  33. 33. Application
  34. 34. PySpark in Data Architectures
  35. 35. Architecture #1 Agent File System Apache Spark File System Agent ES 1 2 3 Data Flow
  36. 36. Architecture #2 Data Flow Agent 1 2 3 Agent Agent Athena S3 S3 Apache Spark
  37. 37. Architecture #3 Data Flow Agent 1 2 3 Agent Agent ES S3 HDFS Apache Kafka Apache Spark HBase
  38. 38. What We’ll Build (Simple) Agent File System Apache Spark File System 1 2 3 Data Flow
  39. 39. Python • Analysis • Visualization • Code in our Spark jobs
  40. 40. Spark • By using PySpark
  41. 41. PySpark • Process all the data! • Perform aggregations
  42. 42. Docker • Run Spark in a Docker container. • So you don’t have to install anything.
  43. 43. Code Time!
  44. 44. README • • Create a virtual environment (Anaconda) • Install dependencies • Run docker-compose to create the Spark containers • Run a script (or all of them!) per the README
  45. 45. Dive In • Data explorer notebook • Q1 - Most polluted state • Q2 - Most polluted county • Q3 - Top pollutants by unit of measure • Q4 - Pollutants over time
  46. 46. Communication
  47. 47. Q&A
  48. 48. Early Bird Specials!
  49. 49. Intro to Data Science for Software Engineers Goes live October 23, 2017 Normally: $97 Pre-Launch: $47
  50. 50. Where to Find Me Website Lotus Guides LinkedIn Twitter Github robertwdempsey rdempsey rdempsey
  51. 51. Thank You!