Data Processing using
pyspark
Spark using Python as programming language
About us
 ITVersity is a holistic continuous learning platform using
 Video content – www.YouTube.com/itversityin
 Subscription based labs – https://labs.itversity.com
 Community based support – http://discuss.itversity.com
 Udemy: https://www.udemy.com/user/it-versity/
 20$ Coupon – pycon2017 (100 coupons will be issued at the time of
workshop)
Agenda
 Hands on workshop
 Revise Python
 Understand Spark APIs
 Develop programs to process data
Objectives
 Data Model
 Revising Python
 Introduction to Spark
 Initializing the job
 Create RDD using data from HDFS
 Read data from different file formats
 StandardTransformations – using core API
 StandardTransformations – using Spark SQL
 Saving RDD back to HDFS
 Save data in different file formats
LabAccess
 1 day lab access will be provided for the workshop audience
 Credentials will be shared at the time of workshop
 Make sure to install putty or cygwin for accessing the lab
 32 bit OS machines will not be supported
Data Model -
retail_db
Revising
Python
 Quick recap of important concepts in Python
 Basic programming constructs
 Lambda Functions
 Basic I/O operations
 Collections and Operations on collections
 Get revenue for given order_id, by adding order_item_subtotal
from order_items
 Read data from local file system /data/retail_db/order_items/part-
00000
Introduction to
Spark
 Spark is Distributed computing framework
 Bunch ofAPIs to process data
 Higher level modules such as Data Frames/SQL, Streaming, MLLib
and more
 Well integrated with Python, Scala, Java etc
 Spark uses HDFS API to deal with file system
 It can run against any distributed or cloud file systems – HDFS, s3,
Azure Blob etc
 Only Core Spark and Spark SQL (including Data Frames) is part of
the curriculum for CCA Spark and Hadoop Developer. CCA also
requires some knowledge of Spark Streaming.
 Pre-requisites – Programming Language (Scala or Python)
Introduction to
Spark -
Documentation
 Use 1.6.x documentation for now
 Get familiar with documentation as much as possible
Initializing the
job
 Initialize using pyspark
 Running in yarn mode (client or cluster mode)
 Setting up additional properties – spark.ui.port
 Programmatic initalization of job
 Create configuration object
 Create spark context object
Create RDD
using data
from HDFS
 RDD – Resilient Distributed Dataset
 In-memory
 Distributed
 Resilient
 Reading files from HDFS
 Reading files from local file system and create RDD
 Quick overview ofTransformations andActions
 DAG and lazy evaluation
 Previewing the data using Actions
Read data
from Different
file formats
 Data Frame (Distributed collection with structure)
 APIs provided by sqlContext
 Supported file format
 orc
 json
 parquet
 avro
 Previewing the data
 Show
 Let us read data from json files /public/retail_db_json/orders
Standard
Transformatio
ns
 Standard transformations
 String Manipulation (Python)
 Row level transformations
 Filtering (horizontal and vertical)
 Joins
 Aggregation
 Sorting
 Ranking
 Set Operations
Standard
Transformatio
ns
 Writing SQL for standard transformations
 String Manipulation
 Row level transformations
 Filtering (horizontal and vertical)
 Joins
 Aggregation
 Sorting
 Ranking
 Set Operations
Saving RDD
back to HDFS
 Make sure data is saved with proper delimiters
Save data in
different file
formats
 Data Frames can be saved into different file formats
 json
 orc
 parquet

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - Workshop

  • 1.
    Data Processing using pyspark Sparkusing Python as programming language
  • 2.
    About us  ITVersityis a holistic continuous learning platform using  Video content – www.YouTube.com/itversityin  Subscription based labs – https://labs.itversity.com  Community based support – http://discuss.itversity.com  Udemy: https://www.udemy.com/user/it-versity/  20$ Coupon – pycon2017 (100 coupons will be issued at the time of workshop)
  • 3.
    Agenda  Hands onworkshop  Revise Python  Understand Spark APIs  Develop programs to process data
  • 4.
    Objectives  Data Model Revising Python  Introduction to Spark  Initializing the job  Create RDD using data from HDFS  Read data from different file formats  StandardTransformations – using core API  StandardTransformations – using Spark SQL  Saving RDD back to HDFS  Save data in different file formats
  • 5.
    LabAccess  1 daylab access will be provided for the workshop audience  Credentials will be shared at the time of workshop  Make sure to install putty or cygwin for accessing the lab  32 bit OS machines will not be supported
  • 6.
  • 7.
    Revising Python  Quick recapof important concepts in Python  Basic programming constructs  Lambda Functions  Basic I/O operations  Collections and Operations on collections  Get revenue for given order_id, by adding order_item_subtotal from order_items  Read data from local file system /data/retail_db/order_items/part- 00000
  • 8.
    Introduction to Spark  Sparkis Distributed computing framework  Bunch ofAPIs to process data  Higher level modules such as Data Frames/SQL, Streaming, MLLib and more  Well integrated with Python, Scala, Java etc  Spark uses HDFS API to deal with file system  It can run against any distributed or cloud file systems – HDFS, s3, Azure Blob etc  Only Core Spark and Spark SQL (including Data Frames) is part of the curriculum for CCA Spark and Hadoop Developer. CCA also requires some knowledge of Spark Streaming.  Pre-requisites – Programming Language (Scala or Python)
  • 9.
    Introduction to Spark - Documentation Use 1.6.x documentation for now  Get familiar with documentation as much as possible
  • 10.
    Initializing the job  Initializeusing pyspark  Running in yarn mode (client or cluster mode)  Setting up additional properties – spark.ui.port  Programmatic initalization of job  Create configuration object  Create spark context object
  • 11.
    Create RDD using data fromHDFS  RDD – Resilient Distributed Dataset  In-memory  Distributed  Resilient  Reading files from HDFS  Reading files from local file system and create RDD  Quick overview ofTransformations andActions  DAG and lazy evaluation  Previewing the data using Actions
  • 12.
    Read data from Different fileformats  Data Frame (Distributed collection with structure)  APIs provided by sqlContext  Supported file format  orc  json  parquet  avro  Previewing the data  Show  Let us read data from json files /public/retail_db_json/orders
  • 13.
    Standard Transformatio ns  Standard transformations String Manipulation (Python)  Row level transformations  Filtering (horizontal and vertical)  Joins  Aggregation  Sorting  Ranking  Set Operations
  • 14.
    Standard Transformatio ns  Writing SQLfor standard transformations  String Manipulation  Row level transformations  Filtering (horizontal and vertical)  Joins  Aggregation  Sorting  Ranking  Set Operations
  • 15.
    Saving RDD back toHDFS  Make sure data is saved with proper delimiters
  • 16.
    Save data in differentfile formats  Data Frames can be saved into different file formats  json  orc  parquet