Big Data for Data Scientists - Info Session

Big Data for Data Scientists
Trends and Use Cases
WeCloudData
@WeCloudData @WeCloudData tordatascience
weclouddata
WeCloudData tordatascience

Career
Services
Meetup
Events
Introduction
Data Skills
Training
WeCloudData offers Toronto’s first data
science accelerator program. We specialize
in teaching lead-edge tools such as AWS,
Spark, and Machine Learning and help our
corporate clients upskill/reskill their data
teams

WCD works with some of the most
talented and experienced data
science experts to deliver public
and corporate trainings. We
currently have 21 part-time and 2
full-time instructors.
Our instructors bring their analytical
expertise from various industries,
teach students advanced tools
such as Python, Hadoop, Spark,
and AWS, mentor students on end-
to-end data projects.
Introduction
Faculty Team
21
Instructors
10
Teaching
Assistants

Python for SAS
and SQL Users
Machine
Learning |
Deep Learning
Big Data
Executive
Workshops
Product & Services
Corporate Training
We offer customized corporate training to Canadian
companies with flexible schedules and learning
support!
We help train, upskill, and
reskill data teams!

Python for SAS Users
Machine Learning
Big Data
AI/DS for Executives
Corporate Data Programs
We’ve delivered customized trainings to many large Canadian companies
WeCloudData
Corporate
Program
We offer customized corporate training to Canadian
companies with flexible schedules and learning
support!
We help train, upskill,
and reskill data
teams!

Introduction
Communities we’re building
8,000 members
120 events
We organize one of the most active DS
communities in Canada!

Workshop Provider
Conference/Clients
Workshop Provider
TMLS Conference
November, 2018
Workshop Provider
TD Canada
Analytics Month
October, 2018
• Machine Learning Open Data
• Spark ML and MLflow
• Deep Learning with PyTorch
• Python for SAS Users
• Machine Learning with Python
Workshop Provider
Big Data & AI
Toronto 2019
June, 2019
• Big Data in AWS Cloud
• Spark for Data Science
• Moving from On-Prem to Cloud
WeCloudData is the conference workshop choice of vendors in Toronto due to our expertise and
specialty.

Analytics Events
We help companies with hiring/branding events
WeCloudData organizes one of the
largest and most active data science
communities in Toronto with 7,500
members and 110 past events. We
help companies facilitate mini-
conferences and help them run hiring
events.

2005
2007
2008 2010
2011
2015
2012
2014 2016 2018
Instructor
Shaohua Zhang
• Co-founder and CEO of WeCloudData. Lead instructor for the corporate training program
• Certified SAS Predictive Modeler since 2007 (among the first 20 in the world)
• Helped build and lead the data science team at BlackBerry (2010 – 2015)
• Helping Communitech incubator and Open Data Exchange mentor startups on data strategies
• Specializes in machine learning, big data, and cloud computing

Learning Path
Data Science Program
Prerequisites
Data Science
Learning Path
Learn to build ML
models using
Sklearn
ML Applied
Master data
wrangling with
Python
Data Science
w/ Python
Harness big data
with Hadoop, Hive,
Presto, and
AtScale
Big Data
Build your portfolio
with hands-on
Capstone projects
ML Advanced
Machine Learning
at Scale with
PySpark ML and
Real-time
Deployment
Spark
Contact us about the courses:
• info@weclouddata.com
Upcoming courses:
• https://weclouddata.com/upcoming-course-schedule

Linux/Docker
Scala
Spark
Programming for
Data Engineering
Hadoop/Hive
Data Ingestion
Workflow
NoSQL
ETL (Big Data)
Spark Internals
Spark Tunings
Spark In-Depth
Kafka
Spark Streaming
Apache Flink
Realtime Analytics
Scaling ML
Model Deployment
Pipeline Automation
Machine Learning
Engineering
Learn to build data pipelines, scale
data processing with big data tools,
and deployment real-time
applications and machine learning
models at scale.
Data Engineering
Learning Path
Learning Path
Data Engineering Program
Contact us about the courses:
• info@weclouddata.com
Upcoming courses:
• https://weclouddata.com/upcoming-course-schedule

Data Jobs in the
MarketData Handling Complex Analytics Big Data Storytelling
Data Science
Data Scientist

Coding/Tools
Math/ML Storytelling
Data
Scientist
Linux
Python/Scala/Java
Cloud (AWS)
Hadoop, Spark
Statistics
Linear Algebra
Regression
Classification
Clustering
NLP
Presentation
Use cases
Project Mgmt
Communications
Data Science
Essential Skills

Data Scientist Data Analyst
Data Science
Job requirements

Data
Application
Scraping/API
Labeled data
Infra/
Platform
RDBMS
Hadoop
Cloud
Data Engineering
ETL
Enrichment
Dataflow
automation
AI/ML
Python
ML
Deployment
Prediction API
Stream
processing
Data Science
The Myth

Data Scientist
The Types
Operational DS
Focus: data wrangling, work with
large/small messy data, builds
predictive models
Strength: data handling, tools, business
knowledge
ML Engineer
Focus: ML model deployment, data
pipelines
Strength: coding, algorithms, machine
learning, platforms and tools
ML Researcher
Focus: algorithm development,
research, IP
Strength: ML/DL algorithms,
implmentation, research
DS Product Mngr
Focus: product strategy, business
communications, project management
Strength: product sense, business
requirements, DS acumen

Data Scientists are like unicorns… so they’re hard to find. Let’s
focus on building the data science teams.. that have data scientists,
engineers, and analysts working towards the same goal.
Data Science Team

2008 2010 2015 2016 2018
Predictive
Modeler
Grad
School
Data
Scientist
Data
Scientist
Instructor
DS
Trainer
Mentor
My DS Journey
Shaohua Zhang
Operational
Data Scientist
Product
Manager
Data/ML
Engineer
Tools
Projects
Churn
Up-sell/Cross-sell
Social Network
Recommender
Big Data
Cloud
Chatbot
Deployment
HR | Retail | Digital Analytics
Predictive Maintenance

Predictive
Modeler
GrowthAcquisition Maturity Decline Loss
● Lead Gen
● Digital Mktg
● Mobile Ads
● Cross/Up-sell
● Segmentation
● CLTV
● Taste graph
● Personalization
● Loyalty Management
● Context-based Mkgt
● Churn models
● Retention
Acquisition
Models
LTV Loyalty
Management
Retention Winback
Customer
Value
● Winback
models
Predict high risk customers

Twitter API
Data
Scientist
Business
Our new product feature received a lot of negative review..
- Can we do some analysis?

Data
Scientist
Business
Our new product feature received a lot of negative review..
- Can we do some analysis?
The analysis looks good. Can we build a small tool?

Credit
Approval
Age Gender
Annual
Salary
Months in
Residence
Months
in Job
Current
Debt
Paid off
Credit
Client 1 23 M $30,000 36 12 $5,000 Yes
Client 2 30 F $45,000 12 12 $1,000 Yes
Client 3 19 M $15,000 3 1 $10,000 No
Client 4 25 M $25,000 12 27 $15,000 ?
Data Preparation

Relational
Database
NewSQLNoSQL
NoSQL
Goolge F1NoSQL
GraphDB
Search
Cache
Databases

Credit: https://arxiv.org/pdf/1409.3809.pdf
GET /velox/catify/predict?userid=22&song=277568
GET /velox/catify/predict_top_k?userid=22&k=100
Velox
Prediction
Service
Model
Manager
Web
Application
HDFS
The Missing Piece
ML Prediction API

Big Data – 4 V’s
Paris
$1000mVolume
London
$1000mVelocity
Tokyo
$1000mVariety
New York
$1000mValue
“More data cross the
internet every second
than were stored in the
entire internet just 20
years ago” - Big Data: The
Management Review
(HBR)
Internet
• 2.3 Zetabytes/day
(2014)
Facebook
• 500 TB/day
(2012)
Programmatic Ads
• 200ms
Fraud Detection
• 400ms
Fraud Prevention
• 50ms
Structured
• Relational
Unstructured
• Image / Voice / Text
Semi-structured
• Graph
“Regardless of its size,
data is worthless if not
turned into actionable
insight”

Internet o 2.5 exabytes (2.5x1018) per
day – 2012
o 2.3 zettabytes (2.3x1021)
per day - 2014
Facebook o 500+ terabytes per day
o 100+ petabytes in a single
Hadoop cluster
“More data cross the internet every second
than were stored in the entire internet just
20 years ago” - Big Data: The Management
Review (HBR)
VelocityVolume Variety
Big Data - Volume

VelocityVolume Variety
video
Big Data – Velocity
Demo

¨ Data Variety
¤ Structured
n Table
n Relational
¤ Unstructured
n Text
n Image
n Audio/Video
¤ Semi-structured
n XML
n JSON
n Graph
Big Data – Variety

History of Big Data (Hadoop)
Hadoop
Big Data
Map Reduce
Apache Spark
Big data - Google Trends
Google MapReduce Paper
Doug Cutting got hired by Yahoo! to work on Hadoop
Spark took off

Knowing more
tools is always
helpful.
Knowing how to
put them to work
together is more
important!

Single Node Architecture
• Traditionally, computation has
been CPU bound
• Complex computation on small
data
• For decades, the primary push is
to increase the computing power
of a single machine

Scale Up vs. Scale Out
• Single Node Architecture
• Scaling up advantage
• Programming is easier than distributed computing
• Faster processing on smaller data
• Scale up disadvantage
• Hardware cost
• Scalability
• Advantage of scale-out systems
• Scalability
• Cost

Traditional Distributed Systems: Problems
• Modern large scale processing is distributed across
machines
• Often hundreds or thousands of nodes
• Focuses on distributing the processing workload
• Powerful compute nodes
• Separate systems for data storage
• Fast network connections to connect them
• Problems with these distributed systems:
• Complex programming model
• It is difficult to deal with partial failures of the system
• Bandwidth limitations
• Data consistency
• Typically at compute time, data is copied to the compute
nodes
• This doesn’t scale to today’s big data
problems!

Data Becomes the Bottleneck
• Traditional distributed systems
don’t scale to today’s Internet-
scale data
• Getting data to the computer
processor becomes the
bottleneck
• Disk I/O is slow
• Network bandwidth is bottleneck
• Solution à moving
computation to the data!
Internet o 2.5 exabytes (2.5x1018)
per day – 2012
o 2.3 zettabytes (2.3x1021)
per day - 2014
Facebook o 500+ terabytes per day
o 100+ petabytes in a
single Hadoop cluster

Modern Distributed Computing Cluster
• Cluster architecture
• A medium-to -large Hadoop
cluster consists of a two-level or
three-level architecture built
with rack-mounted servers.
Each rack of servers is
interconnected using a 1
Gigabyte Ethernet switch. Each
rack-level switch is connected
to a cluster-level switch (which
is typically a larger port-density
10GbE switch)
Stunning Photos Of Google's Massive Data Centers: http://www.forbes.com/pictures/edej45emjgl/up-above-the-massive-floor/

split
node1 node2 node4node3
Block 1 Block 3Block 2
HDFS
Hadoop Distributed File System

• The blocks are replicated to nodes throughout the cluster
• Based on the replication factor (3 by default)
• Replication increases reliability and performance
• Reliability: can tolerate data loss
• Performance: more opportunities for data locality
HDFS - Replications
split
DN1 DN2 DN4DN3
Block 1 Block 3Block 2

• The NameNode stores all metadata
• Information about file locations in
HDFS
• Information about file ownership and
permissions
• Name of the individual blocks
• Locations of the blocks
• Metadata is stored on disk and read
into memory when the NameNode
daemon starts up
• Changes/Edits to the files are written to
the logs
The Name Node
file à /user/lab/myFile.txt
replication à 3
blocksà red,green,blue
block locations à …
Name Node
DN1 DN2 DN4DN3

I wish to wish the
wish you wish to
wish, but if you
wish the wish the
witch wishes, I
won’t wish the
wish you wish to
wish
I wish to wish
the wish you
wish to
wish, but if you
wish the wish
the
witch wishes, I
won’t wish the
wish you wish
to wish
1
1 11 1
1 1
1
1
I
wish
to
the
you
1 11
1
1
1
wish
but
if
you
the 1 1
1
1
1
1
witch
wishes
I
won’t
wish 1 1
the
you
to
1 1
1
1
1
1
4
2
1
1
I
wish
to
the
you
3
1
1
1
wish
but
if
you
the 2
1
1
1
1
witch
wishes
I
won’t
wish 4
the
you
to
1
1
1
but
I
if
to
the
witch
wishes
won’t
wish
you
1
2
1
3
4
1
1
1
11
3
1
1 1
but
I
if
2
1
to
the
1
1
1
witch
wishes
won’t
wish 4
you 1
1
2 1
3 4
1 1
1
Documents
Splitting Map Shuffle/SortCombine Reduce
MapReduce handles these
automatically for you!!
MapReduce - WordCount

Slave
Slave
Slave
Hadoop
Hive
HDFS
hive > create table tweets_filter as
> select * from tweets
> where to_date(ts) in (‘2010-03-02’,
‘2010-0303’)
Hive Driver
Interpret the query
Optimize the computation
Create job plan and send to Hadoop
Hive CLI
TT 1
MySQL
Metast
ore
Master
Job2398564
Apache Hive
Map
JobTracker
NameNode
TT 2
TT 3

Apache Presto
Advantage
Daily/Hourly Batch Jobs Interactive Queries
Daily/Hourly Batch Jobs
Interactive Queries

Apache Presto
Advantage
Daily/Hourly Batch Jobs
Interactive Queries
SQL on any datasets

10x – 100x
MapReduce vs. Spark

Multi-core CPUs
RAM
Hard Drive SSD
Nodes in a different rack
Network
1Gb/s or
125 MB/s
100 MB/s
600 MB/s
10GB/s
0.1Gb/s
RAM vs. Disk vs. Network

• A unified platform that supports many data processing needs
including
• Batch processing (Spark)
• Stream processing (Spark Streaming)
• Interactive (SparkSQL)
• Iterative (MLlib, ML, GraphX, GraphFrame)
Spark - Unified Data Platform
O
ne
size
fits
m
any!

Visualization
Advanced Analytics
Data Processing
Database
Data Scientist Toolbox (Big Data)
Enterprise - Traditional

Visualization
Advanced Analytics
Data Processing
Platform
Enterprise – New/Cloud

Visualization
Advanced Analytics
Data Processing
Data Lake
Startups | Tech | Digital Labs | Big Data Teams

Visualization
Advanced Analytics
Data Processing
Data Lake
Enterprise – New Trend

Course Detail

About this course
• For learners who want to get started with big data, the sheer number of tools in the
ecosystem always feels overwhelming and confusing. With a well-structured
curriculum and instructors who have years of industry experience implementing big
data solutions, the Big Data for Data Scientist will help you focus on learning the tools
that matter the most.
• This course covers several popular big data platforms and frameworks that modern
data scientists and analysts need to master. Students learn throughout the course to
integrate different tools such as Hadoop, Hive, Presto, AWS, and NoSQL to solve real-
world data challenges.
• The course is built around an end-to-end big data pipeline to process terabyte scale
data (billions of records) in a cloud environment. Students gain first-hand experience
on data collection, ingestion, distributed storage, distributed processing, and
interactive visualizations.
• Many big data use cases will be covered to help consolidate the learnings and most
importantly students gain real-life experience and confidence to apply the knowledge
learned back to their data science projects at work.

Who is this course for?
• This course serves as a great foundational course for professionals who want to
switch career, graduates who want to get into this field as a data scientist, and big
data enthusiasts who want to learn the hottest big data tools such as Hadoop, Hive,
Presto, AWS, and NoSQL and apply them to solve real-world big data problems.
• For new graduates and job seekers, this course teaches you the essential big data
tools and concepts required for modern data scientist jobs and then complementary
big data interview questions will get your prepared for interview challenges.
• For data scientists who want to gain new skills, the course will give you
comprehensive view of the big data ecosystem and prepare you for the big data tasks
at work.
• For tech-savvy project managers who want to gain a comprehensive understanding
of big data use cases and lifecycles, the hands-on project in this course gives you
exactly what you hope for.

Learning outcome
After this course, the students will be able to
• Gain competence to take on real data challenges at workplace and demonstrate
experience and advantage in the job market with the learned skills added to the resume
• Gaining solid understanding of the Big Data ecosystem and various real-world use cases
• Comfortable working with different big data platforms such as Hortonworks and AWS
EMR, run Hive ETL pipelines and querying large datasets with Apache Presto
• Build and automate data pipelines with Apache Airflow and build a project demo via
visualization dashboard with Superset
• Gain real world experience through a hands-on project and convince your
manager/peers that you’re up for big data related projects at work

2005
2007
2008 2010
2011
2015
2012
2014 2016 2018
Instructor – Shaohua Zhang
• Co-founder and CEO of WeCloudData. Lead instructor for the Big Data course and the
corporate training program
• Helped build and lead the data science team at BlackBerry (2010 – 2015)
• Helping Communitech incubator and Open Data Exchange mentor startups on data strategies
• Specialize in machine learning, big data, and cloud computing

Prerequisites
Prerequisites
• You do not need prior experience with programming languages such as python, but it
helps!
• Familiarity with Linux Commands, SQL and relational database concepts
• Having an understanding of your company’s big data use case, technologies, and goals
will motivate and direct your focus in this course

Lecture Content Lecture Content
1
Big Data
• Introduction to Big Data
• Big Data Use Cases
• AWS – EC2/S3
7
Spark Core
• Introduction to Spark Core
• Spark RDD Operations
2
Hadoop
• Hadoop Data Distributed Filesystem
• MapReduce with Python
• AWS - EMR
8
Spark DataFrame |
SQL
• Spark DataFrame and SQL
• Complex Transformations and UDFs
3
Apache Hive |
Sqoop
• Hive Introduction
• Hive Queries
• Apache Sqoop
• Project kick-off
9
Spark Performance
Tuning
• Spark Internals
• Performance Tunings
4
SQL on Hadoop
• Presto/Impala
• Apache Kylin/AtScale
10
Spark ML
• Spark Machine Learning API
• Building Classification and Regression Models
5
NoSQL
• Amazon DynamoDB
• Cassandra
• Elasticsearch
11
Spark ML II
• Recommender System with Spark
• Deep Learning on Spark
6
Data Pipeline
• Data pipeline with Airflow
• Visualization with Superset
• Project Discussion
12
Spark Streaming
• Kafka/Kinesis
• Spark Streaming
• Project Presentation
Syllabus
Syllabus (Weekend Cohort – 12 sessions/48 hours)

Industry Use Cases
In this course, we not only teach students how to use the big data tools, but also common
use cases. Understanding the real-world use cases and industry best practices will allow
the students to apply skills to their company’s data problems
Use Cases
• Big data use cases in retail personalization
• Big data use cases for retail banking
• Big data use cases for fraud analytics
• Big data use cases in compliance analytics
• Big data use cases in online advertising

Hands-on Project
This course is instructor-led and project-based. Students will be able to apply the big data
knowledge acquired during the lectures build an end-to-end big data project.
Project: Building an AWS-based Big Data Pipeline
• Real-time data collection and ingestion via Kinesis and NoSQL
• Build Hive databases and ETL pipelines
• Interactive data analysis with Presto
• Building streaming MOLAP cubes with Apache Kylin
• Real-time dashboard with Apache Superset
• Workflow automation with Apache Airflow
Data Size: 500GB ~ 1TG
Records: 1 billion +
Twitter API
Kinesis

Student Project Demo
Stock price prediction using twitter sentiment and deep learning

Student Project Demo
Real-time Twitter Sentiment Pipeline

Learning Support
Support you will receive during this course include
• Mentorship and advice from an industry expert
• In-classroom learning assistance by our assistant instructor
• Online learning support on Slack from instructor and TA
• Hands-on labs and projects to help you apply what you learn
• Additional resources to help you gain advanced knowledge
• Help from our learning advisor on how to choose the learning path and
specialization courses after the Big Data course

Testimonials
This course really helped me with in-depth explanation and application of Cloud and Big Data technologies. The lead instructor is
very enthusiastic and gifted with years of industry experience as a chief data scientist. The course has a well-designed with
systematic curriculum structure where you get to learn each component of the Big Data Ecosystem with a big picture of the
whole Machine-Learning pipeline (online and offline).
Jason Lee
Student Testimonial
I took the Big Data course with WeCloudData. The course introduces the latest big data tools and platforms such as Apache
Hadoop and Amazon Web Services, as well as real-world use cases and industrial best practices. The course also includes an end-
to-end group project which will definitely be something you can be proud of.
I chose this course basically because my company uses Apache Spark and Hadoop distributed system, and I would like to learn
more about it. Surprisingly, what I learned from this course has been far beyond my expectation! I wish I knew WeCloudData
earlier so that I wouldn't have been that struggled at work.
I would also like to express my gratitude and appreciation to the instructor Shaohua in this course. He is extraordinarily
knowledgeable and experienced, one of the best instructor I have ever seen! The way he approaches to a theory is really
straightforward and easy to understand. He is nice and patient while answering questions as well, and always makes sure every
student is on the right track. The program managers of WeCloudData are kind and amiable too. It was a great pleasure to talk
with them!
Grace Tian

How to convince your employer
Do you know that most employers will reimburse the training costs?
• We have a detailed course syllabus and email template that you can use to convince
your manager that this is the right course for you and a good investment for your
company
• You will have a completed project and presentation that you can use to demo to
your manager and showcase your newly minted Big Data skills and get ready for
more interesting data analytics projects

Price
Course Pricing
Big Data & Spark for DS $2000 + tax

Upcoming WeCloud Events
Event Schedules

Upcoming Events
Schedule
Track
Meetup
Org
Topic Date
Big Data WeCloudData Big Data for Data Scientist – Open Class Jun 4
Big Data WeCloudData Spark on Kubernetes Jun 5
Big Data Lightbend
Running Kafka on Kubernetes with
Strimzi
Jun 11
Cloud Big Data & AI Conference
Machine Learning from Experimentation
to Production on AWS
Jun 12
Big Data Big Data & AI Conference
Transforming big data from On-premise
to the Cloud
Jun 12
Big Data Big Data & AI Conference Spark for Data Science Jun 13
Data Science Big Data & AI Conference Moving Towards a Python Environment Jun 13
Big Data | Data Science WeCloudData
Machine Learning Deployment with
Spark and Amazon Sage Maker
Jun 16
Big Data | Data Science WeCloudData Apache Spark Hands-on Workshop Jun 18
tordatascience
For details, visit https://www.meetup.com/tordatascience/

Big Data for Data Scientists - Info Session

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data for Data Scientists - Info Session

Similar to Big Data for Data Scientists - Info Session (20)

More from WeCloudData

More from WeCloudData (11)

Recently uploaded

Recently uploaded (20)

Big Data for Data Scientists - Info Session