Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - Workshop

Data Processing using
pyspark
Spark using Python as programming language

About us
 ITVersity is a holistic continuous learning platform using
 Video content – www.YouTube.com/itversityin
 Subscription based labs – https://labs.itversity.com
 Community based support – http://discuss.itversity.com
 Udemy: https://www.udemy.com/user/it-versity/
 20$ Coupon – pycon2017 (100 coupons will be issued at the time of
workshop)

Agenda
 Hands on workshop
 Revise Python
 Understand Spark APIs
 Develop programs to process data

Objectives
 Data Model
 Revising Python
 Introduction to Spark
 Initializing the job
 Create RDD using data from HDFS
 Read data from different file formats
 StandardTransformations – using core API
 StandardTransformations – using Spark SQL
 Saving RDD back to HDFS
 Save data in different file formats

LabAccess
 1 day lab access will be provided for the workshop audience
 Credentials will be shared at the time of workshop
 Make sure to install putty or cygwin for accessing the lab
 32 bit OS machines will not be supported

Revising
Python
 Quick recap of important concepts in Python
 Basic programming constructs
 Lambda Functions
 Basic I/O operations
 Collections and Operations on collections
 Get revenue for given order_id, by adding order_item_subtotal
from order_items
 Read data from local file system /data/retail_db/order_items/part-
00000

Introduction to
Spark
 Spark is Distributed computing framework
 Bunch ofAPIs to process data
 Higher level modules such as Data Frames/SQL, Streaming, MLLib
and more
 Well integrated with Python, Scala, Java etc
 Spark uses HDFS API to deal with file system
 It can run against any distributed or cloud file systems – HDFS, s3,
Azure Blob etc
 Only Core Spark and Spark SQL (including Data Frames) is part of
the curriculum for CCA Spark and Hadoop Developer. CCA also
requires some knowledge of Spark Streaming.
 Pre-requisites – Programming Language (Scala or Python)

Introduction to
Spark -
Documentation
 Use 1.6.x documentation for now
 Get familiar with documentation as much as possible

Initializing the
job
 Initialize using pyspark
 Running in yarn mode (client or cluster mode)
 Setting up additional properties – spark.ui.port
 Programmatic initalization of job
 Create configuration object
 Create spark context object

Create RDD
using data
from HDFS
 RDD – Resilient Distributed Dataset
 In-memory
 Distributed
 Resilient
 Reading files from HDFS
 Reading files from local file system and create RDD
 Quick overview ofTransformations andActions
 DAG and lazy evaluation
 Previewing the data using Actions

Read data
from Different
file formats
 Data Frame (Distributed collection with structure)
 APIs provided by sqlContext
 Supported file format
 orc
 json
 parquet
 avro
 Previewing the data
 Show
 Let us read data from json files /public/retail_db_json/orders

Standard
Transformatio
ns
 Standard transformations
 String Manipulation (Python)
 Row level transformations
 Filtering (horizontal and vertical)
 Joins
 Aggregation
 Sorting
 Ranking
 Set Operations

Standard
Transformatio
ns
 Writing SQL for standard transformations
 String Manipulation
 Row level transformations
 Filtering (horizontal and vertical)
 Joins
 Aggregation
 Sorting
 Ranking
 Set Operations

Saving RDD
back to HDFS
 Make sure data is saved with proper delimiters

Save data in
different file
formats
 Data Frames can be saved into different file formats
 json
 orc
 parquet

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - Workshop

Similar to Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - Workshop (20)

Recently uploaded

Recently uploaded (20)

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - Workshop