SlideShare a Scribd company logo
1 of 78
Python for Data Science
Introduction to Python
WeCloudData
@WeCloudData @WeCloudData tordatascience
weclouddata
WeCloudData tordatascience
Data Science Job Market
Data Jobs in Canada
Skills – Data Scientist
Data Jobs in Canada
Skills – Data Analyst
Data Jobs in Canada
Skills – Data Engineer
Data Jobs in Canada
Skills – ML Engineer
Data Science Learning Path
Resources
Python
Coding Practice
Coding & Interviews
• LeetCode
• HackerRank
Book Statistics Online Courses
Udemy
• Complete Python Bootcamp
Datacamp
• Introduction to Python
Data Science
Importance of foundations
Data Science
Machine
Learning
Big Data
Data
Engineering
Deep
Learning
ML
Engineering
Focus on one programming language at a time
• Get good at it
Must have skills
• Python
• SQL
Learn Data Science
Follow a Learning Path
Prerequisites
Data Science
Learning Path
• ML algorithms
• 2 Projects
• Interview Practice
Applied ML
• Data wrangling
• Data Visualization
• Predictive Modeling
Data Science
w/ Python
• Big data tools
• ML at scale
• ML deployment
• Job referrals
Big Data
Python
Foundation
SQL for
Data Science
Python
Scientific Python
The ecosystem
Fraud Detection Project Reference Architecture
data lake
Learn Data Science
Understand Data pipelines
data ingestion preprocessing machine learning
real-time scoring
logging and monitoring
reporting & exploration
metadata
Why Python?
The ecosystem
Targeting
Profiles
Personaliz
ation
Parse/Filter Classify Synthesize
POI Prediction
Ontology/TaxonomyContexts
URL Parsing
POI Database
Context Extraction
Topic Modeling
Content Classification
Location Classify
Signal Aggregation
Taste Merging
Taste Scoring
Data Data Science Pipelines Data Product
POI Context Builder
Rule-based Predictor
ML Predictor
Location Attributes
Home/Work Predictor
Co-location Location
Graph
• sklearn
• gensim
• nltk
• mrjob
• PySpark
• PySpark
Why Python?
Python in a data science project
Python Data Management
Structured Data with Pandas DataFrame
Row Index Population Area
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
Column
DataFrame
Values
Column
Row
Row
Row
Row
Row
Row Index
Row Index
Row Index
Row Index
Row Index
Column Index Column Index
# row access returns Series
states.loc['Florida']
# column access returns Series
states['area']
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
population 170312
area 19552860
# index based selection
states.iloc[1:3, :1]
Row
Index
Population
Florida 170312
Illinois 149995
Series
Series
DataFrame
Python Data Management
Pandas - GroupBy()
City
Ticket
Sales
Toronto 100
Montreal 50
Toronto 20
Halifax 40
Montreal 30
Halifax 60
City
Ticket
Sales
Toronto 100
Toronto 20
City
Ticket
Sales
Montreal 50
Montreal 30
City
Ticket
Sales
Halifax 40
Halifax 60
City
Ticket
Sales
Toronto 60
City
Ticket
Sales
Montreal 40
City
Ticket
Sales
Halifax 50
City
Ticket
Sales
Toronto 60
Montreal 40
Halifax 50
Input
DataFrame
Split
DataFrameGroupBy
Apply (sum)
DataFrameGroupBy
Combine
DataFrame
df = pd.DataFrame({'city' : ['Toronto', 'Montreal', 'Toronto', 'Halifax',
'Montreal', 'Halifax'],
'sales' : [100, 50, 20, 40, 30, 60]})
Python Data Management
Pandas - Join/Merge # 1-to-1 join
pd.merge(employee, hr, how='inner', on='employee')
Other features
• Pivot Tables
• Window Functions
DataVisualizations
Matplotlib, Seaborn, Plotly, Bokeh
Python
Plotly Dash
Python
Plotly Dash
Python Data Munging
Database Integration
from mrjob.job import MRJob
class MRWordCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield(word, 1)
def reducer(self, word, counts):
yield(word, sum(counts))
if __name__ == '__main__':
MRWordCount.run()
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' ')) 
.map(lambda x: (x, 1)) 
.reduceByKey(add)
output = sorted(counts.collect(), key=lambda x: x[1], reverse=True)[:50]
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
API Support for Big Data Platforms
Hadoop/Spark
Machine Learning
Sklearn | Gensim | NLTK
ML at Scale
PySpark ML
Data Engineering
Data Pipelines with Airflow
Deep Learning
Strong Python Support
Python
IoT/Robotics
Python
Application
Demo
Data Types and Representations
Data Representation
Images – Numpy Arrays
import matplotlib.pyplot as plt
%matplotlib inline f =
open("../data/mnist_train_100.csv", 'r’)
mnist = f.readlines() f.close()
# Use matplotlib to visualize the digits
f = plt.figure(figsize=(10,10))
count=0
for line in mnist[:9]:
count += 1
linebits = line.split(',’)
imarray = np.asfarray(linebits[1:]).reshape((28,28))
plt.subplot(3,3,count)
plt.subplots_adjust(hspace=0.5)
plt.title("Label is " + linebits[0])
plt.imshow(imarray, cmap='Greys',
interpolation='None')
Data Representation
Images – Numpy Arrays
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:120% !important; }</style>"))
np.get_printoptions()['linewidth’]
np.set_printoptions(linewidth=200)
digit = np.asfarray(mnist[0].split(',')[1:]).reshape((28,28))
print(digit)
Data Representation
Flat Files
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
[Toronto , 2019-11-01 00:00:00 , 100]
[Montreal , 2019-11-01 00:00:00 , 50]
[Toronto , 2019-11-02 00:00:00 , 20]
[Halifax , 2019-11-01 00:00:00 , 40]
[Montreal , 2019-11-02 00:00:00 , 30]
[Halifax , 2019-11-02 00:00:00 , 60]
[[Toronto , 2019-11-01 00:00:00 , 100],
[Montreal , 2019-11-01 00:00:00 , 50],
[Toronto , 2019-11-02 00:00:00 , 20],
[Halifax , 2019-11-01 00:00:00 , 40],
[Montreal , 2019-11-02 00:00:00 , 30],
[Halifax , 2019-11-02 00:00:00 , 60]]
rows of string rows of list List of lists
Data Representation
Flat Files with Schema
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
{City:Toronto , Datetime: 2019-11-01 00:00:00 , Sales: 100}
{City:Montreal , Datetime: 2019-11-01 00:00:00 , Sales: 50}
{City:Toronto , Datetime: 2019-11-02 00:00:00 , Sales: 20}
{City:Halifax , Datetime: 2019-11-01 00:00:00 , Sales: 40}
{City:Montreal , Datetime: 2019-11-02 00:00:00 , Sales: 30}
{City:Halifax , Datetime: 2019-11-02 00:00:00 , Sales: 60}
[{City: Toronto ,
Datetime: 2019-11-01 00:00:00 ,
Sales: 100} ,
{City: Montreal ,
Datetime: 2019-11-01 00:00:00 ,
Sales: 50} ,
{City: Toronto ,
Datetime: 2019-11-02 00:00:00 ,
Sales: 20} ,
{City: Halifax ,
Datetime: 2019-11-01 00:00:00 ,
Sales: 40} ,
{City: Montreal ,
Datetime: 2019-11-02 00:00:00 ,
Sales: 100} ,
{City: Halifax ,
Datetime: 2019-11-02 00:00:00 ,
Sales: 100} ,
]
rows of string
rows of dictionary
List of dictionaries
Data Representation
Tabular data
City Date
Ticket
Sales
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
rows of string structured table
Pandas
Series
0 0.019111
1 -0.406886
2 -0.684147
3 0.713178
4 -0.573002
array([ 0.01911064, -0.40688553, -0.68414699, 0.71317766, -0.5730021)
Index
Values
array([ 0, 1, 2, 3, 4)
import pandas as pd
from numpy.random import randn
# create a pandas series with random numbers
s1 = pd.Series(randn(5))
print(s1)
Pandas
DataFrame
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
# create the population series
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
# create the area series
area_dict = {'California': 423967,
'Texas': 695662,
'New York': 141297,
'Florida': 170312,
'Illinois': 149995}
area = pd.Series(area_dict)
# create the dataframe from a dict of series
states = pd.DataFrame({'population': population,
'area': area})
statesIndex Values
Index Values
# create the dataframe from a dict of series
states = pd.DataFrame({'population': ,
'area':
})
Series
Series
Row
Index
Population Area
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
Columns
Indices
Row
Index
Values
Pandas
DataFrame – Row/Column
Row Index Population Area
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
Column
DataFrame
Values
Column
Row
Row
Row
Row
Row
Row Index
Row Index
Row Index
Row Index
Row Index
Column Index Column Index
Manipulating Tabular Data
Tabular Data Manipulation
Selection/Indexing
City Datetime
Ticket
Sales
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
City
Ticket
Sales
Toronto 100
Montreal 50
Toronto 20
Halifax 40
Montreal 30
Halifax 60
Tabular Data Manipulation
Filtering
City Datetime
Ticket
Sales
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
City Date
Ticket
Sales
Halifax 2019-11-01 00:00:00 40
Halifax 2019-11-02 00:00:00 60
Tabular Data Manipulation
Transformation
Cityc Datetime
Ticket
Sales
Toronto 2019-11-01 00:00:00 100
Montreal 2019-11-01 00:00:00 50
Toronto 2019-11-02 00:00:00 20
Halifax 2019-11-01 00:00:00 40
Montreal 2019-11-02 00:00:00 30
Halifax 2019-11-02 00:00:00 60
City Date
Ticket
Sales
Toronto 2019-11-01 100
Montreal 2019-11-01 50
Toronto 2019-11-02 20
Halifax 2019-11-01 40
Montreal 2019-11-02 30
Halifax 2019-11-02 60
Weekday
Monday
Monday
Tuesday
Monday
Tuesday
Tuesday
Tabular Data Manipulation
Data Aggregation
City
Ticket
Sales
Toronto 100
Montreal 50
Toronto 20
Halifax 40
Montreal 30
Halifax 60
City
Ticket
Sales
Toronto 100
Toronto 20
City
Ticket
Sales
Montreal 50
Montreal 30
City
Ticket
Sales
Halifax 40
Halifax 60
City
Ticket
Sales
Toronto 60
City
Ticket
Sales
Montreal 40
City
Ticket
Sales
Halifax 50
City
Ticket
Sales
Toronto 60
Montreal 40
Halifax 50
Input
DataFrame
Split
DataFrameGroupBy
Apply (sum)
DataFrameGroupBy
Combine
DataFrame
Tabular Data Manipulation
GroupBy - aggregate(), agg()
City
Ticket
Sales
Toronto 100
Montreal 50
Toronto 20
Halifax 40
Montreal 30
Halifax 60
City
Ticket
Sales
Toronto 100
Toronto 20
City
Ticket
Sales
Montreal 50
Montreal 30
City
Ticket
Sales
Halifax 40
Halifax 60
City
Ticket
Sales
Toronto 60
City
Ticket
Sales
Montreal 40
City
Ticket
Sales
Halifax 50
City
Ticket
Sales
Toronto 60
Montreal 40
Halifax 50
df.groupby('city').aggregate({'sales': 'mean'})
A filtering operation allows you to drop data
based on the group properties
Tabular Data Manipulation
GroupBy – filter()
City
Ticket
Sales
Toronto 100
Montreal 50
Toronto 20
Halifax 40
Montreal 30
Halifax 60
City
Ticket
Sales
Toronto 100
Toronto 20
City
Ticket
Sales
Montreal 50
Montreal 30
City
Ticket
Sales
Halifax 40
Halifax 60
City
Ticket
Sales
Avg()
Toronto 100 60
Toronto 20 60
City
Ticket
Sales
Avg()
Montreal 50 40
Montreal 30 40
City
Ticket
Sales
Avg()
Halifax 40 50
Halifax 60 50
City
Ticket
Sales
Toronto 100
Toronto 20
Halifax 40
Halifax 60
df.groupby('city').filter(lambda x: x['sales'].mean()>45)
Tabular Data Manipulation
GroupBy – transform()
While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data
to recombine. For such a transformation, the output is the same shape as the input
City
Ticket
Sales
Toronto 100
Montreal 50
Toronto 20
Halifax 40
Montreal 30
Halifax 60
City
Ticket
Sales
Toronto 100
Toronto 20
City
Ticket
Sales
Montreal 50
Montreal 30
City
Ticket
Sales
Halifax 40
Halifax 60
City
Ticket
Sales
Sum()
Toronto 100 120
Toronto 20 120
City
Ticket
Sales
Sum()
Montreal 50 80
Montreal 30 80
City
Ticket
Sales
Sum()
Halifax 40 100
Halifax 60 100
City
Ticket
Sales
Toronto 0.833
Toronto 0.167
Montreal 0.625
Montreal 0.375
Halifax 0.4
Halifax 0.6
df.groupby('city').transform(lambda x: x / x.sum())
• INNER JOIN: Select records that have matching values in both tables.
• LEFT OUTER JOIN: Select records from the first (left-most) table with matching
right table records.
• RIGHT OUTER JOIN: Select records from the second (right-most) table with
matching left table records.
• FULL OUTER JOIN: Selects all records that match either left or right table records.
Types of Data Transformation
Join/Merge
Left Right
Types of Data Transformation
Inner Join
Left Right
Types of Data Transformation
Left Join
Left Right
Types of Data Transformation
Full Outer Join
Python Environment
Learning Python
The 3 stages of transitions
1. Start the Anaconda
Navigator
2. Click the Spyder launch
button
3. For this course we will be
using the Jupyter
Notebook
Python IDE
Anaconda
Jupyter Notebook - Local Server/Laptop
Browser
Notebook Server
Python
R
Julia
Kernels
Jupyter Notebook
Local Server/Laptop
Browser
NotebookServer
Python
R
Julia
Kernels
Jupyter Notebook
Remote Server
Jupyter Notebook
Start Jupyter from the Command Line
notebook
running on
port 8892
notebook running on port
8892
Jupyter Notebook
Ports
Other Python IDE
PyCharm
Learn Python and Data Science with WeCloudData
Data Science with Python
Student Project Demo
Data Science with Python
Student Project Demo
Analyzing the Top Travel Influencers on Instagram
Data Science with Python
Student Project Demo
Open a webpage
and get HTML of
search results page
Fishing Online
Process data using
pandas and save data
into .csv file
Locating page
elements by XPath
Extract Target Data
NextURL?
Yes
No
Data Science with Python
Student Project Demo
Data Science with Python
Student Project Demo
Learn Data Science
Understand the big picture
Data Science Fundamentals
Python Programming
WeCloudData
Python Programming
Why Python?
• Python is the most popular data
science and AI programming
language
• Many employers prefer candidates
with python skills
• Mastering python will expose you
to not only Data Scientist jobs, but
also Data Engineers and DevOps
Python Programming
Who is this course for?
• The Python Programming course is designed for students and professionals who want to
learn to code in Python before moving on to the data science and machine learning topics
• Whether you’re considering a new career in data science or upskilling yourself to catch the
latest data science technology trends, this course is designed for you
• If you come from an Excel background, this course teaches you programming fundamentals
and help you gain confidence in coding through implementing projects and use cases
• If you come from SAS or R programming background, this course will help you comprehend
the difference and advantage of Python to other analytics tools
• This course is designed for students who don’t have prior experience with Python
Python Programming
Learning outcome
After completing this course, you will be able to
• Master python programming essentials for data science and analytics
• Use Python to analyze social media data and visualize data using Pandas and Matplotlib
• Understand various business use cases and most importantly how to apply best
practices to real company data problems using python
• Implement your first machine learning algorithm using python via a hands-on project
• Gain real-world experience through hands-on labs and convince your employers that
you’re up for new data analytics projects at work
• Most importantly, you will be able to build your confidence in python coding and get
prepared for learning machine learning and data science
Python Programming
Instructor – Jodie Zhu
• Machine Learning Enthusiast at Dessa
• University of Toronto, Master of Science (Biostatistics)
• Python Instructor at WeCloudData
• Career development mentor
• Expertise: Python | Data Science | Deep Learning
Machine Learning Engineer
Dessa
Python Programming
Prerequisites
Prerequisites
• There are no requirements other than a working laptop and internet connection.
• You do not need prior experience with Python or another programming language such
as R, SAS, or Matlab, but it helps!
Software Requirements
• Anaconda Distribution
• (Installation guidance will be shared after students register for the course. Our assistant
instructor will make sure you have the right software installed for the course
Python Programming
Syllabus
• Python use cases
• Branching
• Loops
• Data Types: list, tuple,
set
• Functions
• Lab: social media
analytics with python
Day 1
Python Basics
Day 2
Intermediate Python
Day 3
Python Data Analysis
• Data Types: String
• Data Types: Dictionary
• Comprehensions
• Regular expression
• Modules & Packages
• Class and Object
• Interview – Prepare for
Python Interview Tests
• Lab – Class and object
• Pandas introduction
• Intro to visualization
with python
• Accessing database
with python
• Use case: Python for
ML and AI
• Project: Building your
first ML algorithm with
python
• Python Installation
• Jupyter Introduction
• Python Introduction
• DS Introduction
• Twitter Dev API Setup
Pre-course
Installation & Preview
Python learning resources, pdf, coding
interview questions
Python Programming
Industry Use Cases
In this course, we not only teach students how to program Python, but also real-world
business use cases. Understanding the use cases and best practices will allow the students
to confidently apply Python skills to practical data challenges
Business Use Cases
• Why is Python programming important for data science, AI, and Cloud?
• Python use cases in Data Science
• Python use cases in AI
• Python use cases in the financial industry
• Python use cases in sales analytics
• Python use cases in social media analytics
Python Fundamentals
Learning Support
Support you will receive during this course include
• Mentorship and advice from industry experts
• In-classroom learning assistance by our assistant instructor
• Online learning support via Slack from instructor and TA
• Hands-on labs and projects to help you apply what you learn
• Additional resources to help you gain advanced knowledge
• Help from our learning advisor on how to choose the learning path and
specialization courses after the Python Programming course
Python Fundamentals
Testimonials
Python Fundamentals
How to convince your employer
Do you know that most employers will reimburse the training costs?
• We have a detailed course syllabus and email template that you can use to convince
your manager that this is the right course for you and a good investment for your
company
• You will have a completed project and presentation that you can use to demo to
your manager and showcase your newly minted Python skills and get ready for more
interesting data analytics projects
Python Fundamentals
FAQ
Questions Answers
For this course, do I need to know programming? No. Students don’t need to know Python or other programming
languages before taking this course
Do I need to complete hands-on projects? Yes. You will need to complete the project and pass a quiz
before you’re qualified for continuing the data science courses
How does this Python course compare to other
comparable courses?
It’s instructor-led, hands-on, and project-based. You learn to
get started with Python programming with a fraction of the
cost from our competitors
What is the teaching philosophy? Use-case driven, hands-on and fun!
How many hours a week do I expect to put in? Most labs are completed in the class. Expect to put in
additional 3-4 hours a week to work on projects and
assignments
Do I get career support if I take a part-time course? No, career support is not included in any courses. We suggest
students work with a 3rd party career service provider such as
WeCareer.
Introduction to Python by WeCloudData

More Related Content

Similar to Introduction to Python by WeCloudData

Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018DataLab Community
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionJuan Sequeda
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerDatabricks
 
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...NoSQLmatters
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Tabular Data Stream: The Binding Between Client and SAP ASE
Tabular Data Stream: The Binding Between Client and SAP ASETabular Data Stream: The Binding Between Client and SAP ASE
Tabular Data Stream: The Binding Between Client and SAP ASESAP Technology
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQLSimon Elliston Ball
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...it-people
 
My SQL Idiosyncrasies That Bite OTN
My SQL Idiosyncrasies That Bite OTNMy SQL Idiosyncrasies That Bite OTN
My SQL Idiosyncrasies That Bite OTNRonald Bradford
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Whose Stack Is It Anyway?
Whose Stack Is It Anyway?Whose Stack Is It Anyway?
Whose Stack Is It Anyway?Ian Thomas
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 

Similar to Introduction to Python by WeCloudData (20)

Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A Reflection
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Tabular Data Stream: The Binding Between Client and SAP ASE
Tabular Data Stream: The Binding Between Client and SAP ASETabular Data Stream: The Binding Between Client and SAP ASE
Tabular Data Stream: The Binding Between Client and SAP ASE
 
MySQL under the siege
MySQL under the siegeMySQL under the siege
MySQL under the siege
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
PyData Paris 2015 - Track 1.2 Gilles Louppe
PyData Paris 2015 - Track 1.2 Gilles LouppePyData Paris 2015 - Track 1.2 Gilles Louppe
PyData Paris 2015 - Track 1.2 Gilles Louppe
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
How to Use JSON in MySQL Wrong
How to Use JSON in MySQL WrongHow to Use JSON in MySQL Wrong
How to Use JSON in MySQL Wrong
 
Php forum2015 tomas_final
Php forum2015 tomas_finalPhp forum2015 tomas_final
Php forum2015 tomas_final
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
 
My SQL Idiosyncrasies That Bite OTN
My SQL Idiosyncrasies That Bite OTNMy SQL Idiosyncrasies That Bite OTN
My SQL Idiosyncrasies That Bite OTN
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Whose Stack Is It Anyway?
Whose Stack Is It Anyway?Whose Stack Is It Anyway?
Whose Stack Is It Anyway?
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 

More from WeCloudData

Data Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudDataData Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudDataWeCloudData
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataWeCloudData
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataWeCloudData
 
Machine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudDataMachine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudDataWeCloudData
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataWeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataWeCloudData
 
Data Science with Python - WeCloudData
Data Science with Python - WeCloudDataData Science with Python - WeCloudData
Data Science with Python - WeCloudDataWeCloudData
 
SQL for Data Science
SQL for Data ScienceSQL for Data Science
SQL for Data ScienceWeCloudData
 
Data Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudDataData Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudDataWeCloudData
 
Web scraping project aritza-compressed
Web scraping project   aritza-compressedWeb scraping project   aritza-compressed
Web scraping project aritza-compressedWeCloudData
 
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataWeCloudData
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionWeCloudData
 
WeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew ReyesWeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew ReyesWeCloudData
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901WeCloudData
 

More from WeCloudData (16)

Data Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudDataData Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudData
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Machine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudDataMachine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudData
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Data Science with Python - WeCloudData
Data Science with Python - WeCloudDataData Science with Python - WeCloudData
Data Science with Python - WeCloudData
 
SQL for Data Science
SQL for Data ScienceSQL for Data Science
SQL for Data Science
 
Data Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudDataData Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudData
 
Web scraping project aritza-compressed
Web scraping project   aritza-compressedWeb scraping project   aritza-compressed
Web scraping project aritza-compressed
 
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
WeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew ReyesWeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew Reyes
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 

Recently uploaded

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Recently uploaded (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Introduction to Python by WeCloudData

  • 1. Python for Data Science Introduction to Python WeCloudData @WeCloudData @WeCloudData tordatascience weclouddata WeCloudData tordatascience
  • 3. Data Jobs in Canada Skills – Data Scientist
  • 4. Data Jobs in Canada Skills – Data Analyst
  • 5. Data Jobs in Canada Skills – Data Engineer
  • 6. Data Jobs in Canada Skills – ML Engineer
  • 8. Resources Python Coding Practice Coding & Interviews • LeetCode • HackerRank Book Statistics Online Courses Udemy • Complete Python Bootcamp Datacamp • Introduction to Python
  • 9. Data Science Importance of foundations Data Science Machine Learning Big Data Data Engineering Deep Learning ML Engineering Focus on one programming language at a time • Get good at it Must have skills • Python • SQL
  • 10. Learn Data Science Follow a Learning Path Prerequisites Data Science Learning Path • ML algorithms • 2 Projects • Interview Practice Applied ML • Data wrangling • Data Visualization • Predictive Modeling Data Science w/ Python • Big data tools • ML at scale • ML deployment • Job referrals Big Data Python Foundation SQL for Data Science
  • 13. Fraud Detection Project Reference Architecture data lake Learn Data Science Understand Data pipelines data ingestion preprocessing machine learning real-time scoring logging and monitoring reporting & exploration metadata
  • 15. Targeting Profiles Personaliz ation Parse/Filter Classify Synthesize POI Prediction Ontology/TaxonomyContexts URL Parsing POI Database Context Extraction Topic Modeling Content Classification Location Classify Signal Aggregation Taste Merging Taste Scoring Data Data Science Pipelines Data Product POI Context Builder Rule-based Predictor ML Predictor Location Attributes Home/Work Predictor Co-location Location Graph • sklearn • gensim • nltk • mrjob • PySpark • PySpark Why Python? Python in a data science project
  • 16. Python Data Management Structured Data with Pandas DataFrame Row Index Population Area California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193 Column DataFrame Values Column Row Row Row Row Row Row Index Row Index Row Index Row Index Row Index Column Index Column Index # row access returns Series states.loc['Florida'] # column access returns Series states['area'] California 38332521 Florida 19552860 Illinois 12882135 New York 19651127 Texas 26448193 population 170312 area 19552860 # index based selection states.iloc[1:3, :1] Row Index Population Florida 170312 Illinois 149995 Series Series DataFrame
  • 17. Python Data Management Pandas - GroupBy() City Ticket Sales Toronto 100 Montreal 50 Toronto 20 Halifax 40 Montreal 30 Halifax 60 City Ticket Sales Toronto 100 Toronto 20 City Ticket Sales Montreal 50 Montreal 30 City Ticket Sales Halifax 40 Halifax 60 City Ticket Sales Toronto 60 City Ticket Sales Montreal 40 City Ticket Sales Halifax 50 City Ticket Sales Toronto 60 Montreal 40 Halifax 50 Input DataFrame Split DataFrameGroupBy Apply (sum) DataFrameGroupBy Combine DataFrame df = pd.DataFrame({'city' : ['Toronto', 'Montreal', 'Toronto', 'Halifax', 'Montreal', 'Halifax'], 'sales' : [100, 50, 20, 40, 30, 60]})
  • 18. Python Data Management Pandas - Join/Merge # 1-to-1 join pd.merge(employee, hr, how='inner', on='employee') Other features • Pivot Tables • Window Functions
  • 23. from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield(word, 1) def reducer(self, word, counts): yield(word, sum(counts)) if __name__ == '__main__': MRWordCount.run() from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print("Usage: wordcount <file>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonWordCount") lines = sc.textFile(sys.argv[1], 1) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = sorted(counts.collect(), key=lambda x: x[1], reverse=True)[:50] for (word, count) in output: print("%s: %i" % (word, count)) sc.stop() API Support for Big Data Platforms Hadoop/Spark
  • 24. Machine Learning Sklearn | Gensim | NLTK
  • 30. Data Types and Representations
  • 31. Data Representation Images – Numpy Arrays import matplotlib.pyplot as plt %matplotlib inline f = open("../data/mnist_train_100.csv", 'r’) mnist = f.readlines() f.close() # Use matplotlib to visualize the digits f = plt.figure(figsize=(10,10)) count=0 for line in mnist[:9]: count += 1 linebits = line.split(',’) imarray = np.asfarray(linebits[1:]).reshape((28,28)) plt.subplot(3,3,count) plt.subplots_adjust(hspace=0.5) plt.title("Label is " + linebits[0]) plt.imshow(imarray, cmap='Greys', interpolation='None')
  • 32. Data Representation Images – Numpy Arrays from IPython.core.display import display, HTML display(HTML("<style>.container { width:120% !important; }</style>")) np.get_printoptions()['linewidth’] np.set_printoptions(linewidth=200) digit = np.asfarray(mnist[0].split(',')[1:]).reshape((28,28)) print(digit)
  • 33. Data Representation Flat Files Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 [Toronto , 2019-11-01 00:00:00 , 100] [Montreal , 2019-11-01 00:00:00 , 50] [Toronto , 2019-11-02 00:00:00 , 20] [Halifax , 2019-11-01 00:00:00 , 40] [Montreal , 2019-11-02 00:00:00 , 30] [Halifax , 2019-11-02 00:00:00 , 60] [[Toronto , 2019-11-01 00:00:00 , 100], [Montreal , 2019-11-01 00:00:00 , 50], [Toronto , 2019-11-02 00:00:00 , 20], [Halifax , 2019-11-01 00:00:00 , 40], [Montreal , 2019-11-02 00:00:00 , 30], [Halifax , 2019-11-02 00:00:00 , 60]] rows of string rows of list List of lists
  • 34. Data Representation Flat Files with Schema Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 {City:Toronto , Datetime: 2019-11-01 00:00:00 , Sales: 100} {City:Montreal , Datetime: 2019-11-01 00:00:00 , Sales: 50} {City:Toronto , Datetime: 2019-11-02 00:00:00 , Sales: 20} {City:Halifax , Datetime: 2019-11-01 00:00:00 , Sales: 40} {City:Montreal , Datetime: 2019-11-02 00:00:00 , Sales: 30} {City:Halifax , Datetime: 2019-11-02 00:00:00 , Sales: 60} [{City: Toronto , Datetime: 2019-11-01 00:00:00 , Sales: 100} , {City: Montreal , Datetime: 2019-11-01 00:00:00 , Sales: 50} , {City: Toronto , Datetime: 2019-11-02 00:00:00 , Sales: 20} , {City: Halifax , Datetime: 2019-11-01 00:00:00 , Sales: 40} , {City: Montreal , Datetime: 2019-11-02 00:00:00 , Sales: 100} , {City: Halifax , Datetime: 2019-11-02 00:00:00 , Sales: 100} , ] rows of string rows of dictionary List of dictionaries
  • 35. Data Representation Tabular data City Date Ticket Sales Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 rows of string structured table
  • 36. Pandas Series 0 0.019111 1 -0.406886 2 -0.684147 3 0.713178 4 -0.573002 array([ 0.01911064, -0.40688553, -0.68414699, 0.71317766, -0.5730021) Index Values array([ 0, 1, 2, 3, 4) import pandas as pd from numpy.random import randn # create a pandas series with random numbers s1 = pd.Series(randn(5)) print(s1)
  • 37. Pandas DataFrame California 423967 Florida 170312 Illinois 149995 New York 141297 Texas 695662 California 38332521 Florida 19552860 Illinois 12882135 New York 19651127 Texas 26448193 # create the population series population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} population = pd.Series(population_dict) # create the area series area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995} area = pd.Series(area_dict) # create the dataframe from a dict of series states = pd.DataFrame({'population': population, 'area': area}) statesIndex Values Index Values # create the dataframe from a dict of series states = pd.DataFrame({'population': , 'area': }) Series Series Row Index Population Area California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193 Columns Indices Row Index Values
  • 38. Pandas DataFrame – Row/Column Row Index Population Area California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193 Column DataFrame Values Column Row Row Row Row Row Row Index Row Index Row Index Row Index Row Index Column Index Column Index
  • 40. Tabular Data Manipulation Selection/Indexing City Datetime Ticket Sales Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 City Ticket Sales Toronto 100 Montreal 50 Toronto 20 Halifax 40 Montreal 30 Halifax 60
  • 41. Tabular Data Manipulation Filtering City Datetime Ticket Sales Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 City Date Ticket Sales Halifax 2019-11-01 00:00:00 40 Halifax 2019-11-02 00:00:00 60
  • 42. Tabular Data Manipulation Transformation Cityc Datetime Ticket Sales Toronto 2019-11-01 00:00:00 100 Montreal 2019-11-01 00:00:00 50 Toronto 2019-11-02 00:00:00 20 Halifax 2019-11-01 00:00:00 40 Montreal 2019-11-02 00:00:00 30 Halifax 2019-11-02 00:00:00 60 City Date Ticket Sales Toronto 2019-11-01 100 Montreal 2019-11-01 50 Toronto 2019-11-02 20 Halifax 2019-11-01 40 Montreal 2019-11-02 30 Halifax 2019-11-02 60 Weekday Monday Monday Tuesday Monday Tuesday Tuesday
  • 43. Tabular Data Manipulation Data Aggregation City Ticket Sales Toronto 100 Montreal 50 Toronto 20 Halifax 40 Montreal 30 Halifax 60 City Ticket Sales Toronto 100 Toronto 20 City Ticket Sales Montreal 50 Montreal 30 City Ticket Sales Halifax 40 Halifax 60 City Ticket Sales Toronto 60 City Ticket Sales Montreal 40 City Ticket Sales Halifax 50 City Ticket Sales Toronto 60 Montreal 40 Halifax 50 Input DataFrame Split DataFrameGroupBy Apply (sum) DataFrameGroupBy Combine DataFrame
  • 44. Tabular Data Manipulation GroupBy - aggregate(), agg() City Ticket Sales Toronto 100 Montreal 50 Toronto 20 Halifax 40 Montreal 30 Halifax 60 City Ticket Sales Toronto 100 Toronto 20 City Ticket Sales Montreal 50 Montreal 30 City Ticket Sales Halifax 40 Halifax 60 City Ticket Sales Toronto 60 City Ticket Sales Montreal 40 City Ticket Sales Halifax 50 City Ticket Sales Toronto 60 Montreal 40 Halifax 50 df.groupby('city').aggregate({'sales': 'mean'})
  • 45. A filtering operation allows you to drop data based on the group properties Tabular Data Manipulation GroupBy – filter() City Ticket Sales Toronto 100 Montreal 50 Toronto 20 Halifax 40 Montreal 30 Halifax 60 City Ticket Sales Toronto 100 Toronto 20 City Ticket Sales Montreal 50 Montreal 30 City Ticket Sales Halifax 40 Halifax 60 City Ticket Sales Avg() Toronto 100 60 Toronto 20 60 City Ticket Sales Avg() Montreal 50 40 Montreal 30 40 City Ticket Sales Avg() Halifax 40 50 Halifax 60 50 City Ticket Sales Toronto 100 Toronto 20 Halifax 40 Halifax 60 df.groupby('city').filter(lambda x: x['sales'].mean()>45)
  • 46. Tabular Data Manipulation GroupBy – transform() While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input City Ticket Sales Toronto 100 Montreal 50 Toronto 20 Halifax 40 Montreal 30 Halifax 60 City Ticket Sales Toronto 100 Toronto 20 City Ticket Sales Montreal 50 Montreal 30 City Ticket Sales Halifax 40 Halifax 60 City Ticket Sales Sum() Toronto 100 120 Toronto 20 120 City Ticket Sales Sum() Montreal 50 80 Montreal 30 80 City Ticket Sales Sum() Halifax 40 100 Halifax 60 100 City Ticket Sales Toronto 0.833 Toronto 0.167 Montreal 0.625 Montreal 0.375 Halifax 0.4 Halifax 0.6 df.groupby('city').transform(lambda x: x / x.sum())
  • 47. • INNER JOIN: Select records that have matching values in both tables. • LEFT OUTER JOIN: Select records from the first (left-most) table with matching right table records. • RIGHT OUTER JOIN: Select records from the second (right-most) table with matching left table records. • FULL OUTER JOIN: Selects all records that match either left or right table records. Types of Data Transformation Join/Merge
  • 48. Left Right Types of Data Transformation Inner Join
  • 49. Left Right Types of Data Transformation Left Join
  • 50. Left Right Types of Data Transformation Full Outer Join
  • 52. Learning Python The 3 stages of transitions
  • 53. 1. Start the Anaconda Navigator 2. Click the Spyder launch button 3. For this course we will be using the Jupyter Notebook Python IDE Anaconda
  • 54. Jupyter Notebook - Local Server/Laptop Browser Notebook Server Python R Julia Kernels Jupyter Notebook Local Server/Laptop
  • 56. Jupyter Notebook Start Jupyter from the Command Line
  • 57. notebook running on port 8892 notebook running on port 8892 Jupyter Notebook Ports
  • 59. Learn Python and Data Science with WeCloudData
  • 60. Data Science with Python Student Project Demo
  • 61. Data Science with Python Student Project Demo Analyzing the Top Travel Influencers on Instagram
  • 62. Data Science with Python Student Project Demo
  • 63. Open a webpage and get HTML of search results page Fishing Online Process data using pandas and save data into .csv file Locating page elements by XPath Extract Target Data NextURL? Yes No Data Science with Python Student Project Demo
  • 64. Data Science with Python Student Project Demo
  • 65. Learn Data Science Understand the big picture
  • 66. Data Science Fundamentals Python Programming WeCloudData
  • 67. Python Programming Why Python? • Python is the most popular data science and AI programming language • Many employers prefer candidates with python skills • Mastering python will expose you to not only Data Scientist jobs, but also Data Engineers and DevOps
  • 68. Python Programming Who is this course for? • The Python Programming course is designed for students and professionals who want to learn to code in Python before moving on to the data science and machine learning topics • Whether you’re considering a new career in data science or upskilling yourself to catch the latest data science technology trends, this course is designed for you • If you come from an Excel background, this course teaches you programming fundamentals and help you gain confidence in coding through implementing projects and use cases • If you come from SAS or R programming background, this course will help you comprehend the difference and advantage of Python to other analytics tools • This course is designed for students who don’t have prior experience with Python
  • 69. Python Programming Learning outcome After completing this course, you will be able to • Master python programming essentials for data science and analytics • Use Python to analyze social media data and visualize data using Pandas and Matplotlib • Understand various business use cases and most importantly how to apply best practices to real company data problems using python • Implement your first machine learning algorithm using python via a hands-on project • Gain real-world experience through hands-on labs and convince your employers that you’re up for new data analytics projects at work • Most importantly, you will be able to build your confidence in python coding and get prepared for learning machine learning and data science
  • 70. Python Programming Instructor – Jodie Zhu • Machine Learning Enthusiast at Dessa • University of Toronto, Master of Science (Biostatistics) • Python Instructor at WeCloudData • Career development mentor • Expertise: Python | Data Science | Deep Learning Machine Learning Engineer Dessa
  • 71. Python Programming Prerequisites Prerequisites • There are no requirements other than a working laptop and internet connection. • You do not need prior experience with Python or another programming language such as R, SAS, or Matlab, but it helps! Software Requirements • Anaconda Distribution • (Installation guidance will be shared after students register for the course. Our assistant instructor will make sure you have the right software installed for the course
  • 72. Python Programming Syllabus • Python use cases • Branching • Loops • Data Types: list, tuple, set • Functions • Lab: social media analytics with python Day 1 Python Basics Day 2 Intermediate Python Day 3 Python Data Analysis • Data Types: String • Data Types: Dictionary • Comprehensions • Regular expression • Modules & Packages • Class and Object • Interview – Prepare for Python Interview Tests • Lab – Class and object • Pandas introduction • Intro to visualization with python • Accessing database with python • Use case: Python for ML and AI • Project: Building your first ML algorithm with python • Python Installation • Jupyter Introduction • Python Introduction • DS Introduction • Twitter Dev API Setup Pre-course Installation & Preview Python learning resources, pdf, coding interview questions
  • 73. Python Programming Industry Use Cases In this course, we not only teach students how to program Python, but also real-world business use cases. Understanding the use cases and best practices will allow the students to confidently apply Python skills to practical data challenges Business Use Cases • Why is Python programming important for data science, AI, and Cloud? • Python use cases in Data Science • Python use cases in AI • Python use cases in the financial industry • Python use cases in sales analytics • Python use cases in social media analytics
  • 74. Python Fundamentals Learning Support Support you will receive during this course include • Mentorship and advice from industry experts • In-classroom learning assistance by our assistant instructor • Online learning support via Slack from instructor and TA • Hands-on labs and projects to help you apply what you learn • Additional resources to help you gain advanced knowledge • Help from our learning advisor on how to choose the learning path and specialization courses after the Python Programming course
  • 76. Python Fundamentals How to convince your employer Do you know that most employers will reimburse the training costs? • We have a detailed course syllabus and email template that you can use to convince your manager that this is the right course for you and a good investment for your company • You will have a completed project and presentation that you can use to demo to your manager and showcase your newly minted Python skills and get ready for more interesting data analytics projects
  • 77. Python Fundamentals FAQ Questions Answers For this course, do I need to know programming? No. Students don’t need to know Python or other programming languages before taking this course Do I need to complete hands-on projects? Yes. You will need to complete the project and pass a quiz before you’re qualified for continuing the data science courses How does this Python course compare to other comparable courses? It’s instructor-led, hands-on, and project-based. You learn to get started with Python programming with a fraction of the cost from our competitors What is the teaching philosophy? Use-case driven, hands-on and fun! How many hours a week do I expect to put in? Most labs are completed in the class. Expect to put in additional 3-4 hours a week to work on projects and assignments Do I get career support if I take a part-time course? No, career support is not included in any courses. We suggest students work with a 3rd party career service provider such as WeCareer.