SlideShare a Scribd company logo
Spark 机器学习&深度学习实战
欧 锐
2018/10/31
目录
Spark MLlib 原理
Spark MLlib 实践
Spark-deep-learning 实践
Spark MLlib原理01
决策树找郎君
通俗来说,决策树分类的思想类似于找对象。现想象一个女孩的母亲要给这个女孩介绍男朋友,
于是有了下面的对话:
女儿:多大年纪了? 母亲:26。
女儿:长的帅不帅? 母亲:挺帅的。
女儿:收入高不? 母亲:不算很高,中等情况。
女儿:是公务员不? 母亲:是,在税务局上班呢。
女儿:那好,我去见见。
训练数据
决策树找郎君
Code example:https://github.com/ouyangshourui/spark-mlib-training.git
决策树找郎君
 Machine Learning Overview
 Machine Learning with Spark Mllib & ML
What is Machine Learning?
1. Machine learning is a field within artificial intelligence (AI);
2. Machine learning algorithms “learn” from data and often produce
a predictive model as their output;
3. AI > Machine learning > Deep Learning
What is Machine Learning?
The 7 Steps of Machine Learning
1. Gathering data
2. Preparing that data
3. Choosing a model
4. Training
5. Evaluation
6. Hyperparameter tuning
7. Prediction.
Relationship of Algorithms and Data Volume
1. There are many algorithms for each type of machine learning;
 There’s no overall “best” algorithm;
 Each algorithm has advantages and limitations;
2. Algorithm choice is often related to data volume;
3. Best approach = simple algorithm + lots of data;
4. Spark is an excellent platform for machine learning over large data sets;
Relationship of Algorithms and Data Volume
It’s not who has the best algorithms that wins. It’s who has the most data.
—Banko and Brill, 2001
Spark MLlib and Spark ML
1.Spark MLlib is Spark machine learning library
• Makes practical machine learning scalable and easy
• Includes many common machine learning algorithms
• Includes base data types for efficient calculations at scale
• Supports scalable statistics and data transformations
2、Spark ML is a new higher-level API for machine learning pipelines ->python sklearn
• Built on top of Spark’s DataFrames API
• Simple and clean interface for running series of complex tasks
• Supports most functionality included in Spark MLlib
Feature Engineering
• 我们都知道特征工程在机器学习中是很重要的,然而特征工程到底是什么?怎么样
通俗的理解它呢?打个比方,即使你有再好的渔具,如果给你一片没有鱼的池塘,
那也是白费力气的。而特征工程就是找有鱼的那片水域。所以我们可以这么理解,
特征是数据中抽取出来的对结果预测有用的信息(水域),而特征工程就是使用专
业知识来处理数据,筛选出具有价值的特征(从100个水域中挑选出鱼最多最好的
水域)。所以有句话是这么说的:算法再牛逼,其上限也是由特征工程决定的,就
像你渔具再好,捕鱼多少也是由水域这个特征决定的。
在SparkML中、对于特征工程的操作主要分为特征提取,特征转化、特征选择。
Feature Engineering(TF-IDF)
1. (Term frequency-inverse document frequency)
2. TF-IDF完美的解决了这个问题,TF-IDF作用就是体现一个文档中词语重要程
度。TF是某个词或短语在一篇文章中出现的频率。而IDF,就是一种对热门词
语的惩罚,对于较热门词语比如“中国”会给予较小的权重,较少见的词“功夫”
给予较大的权重;
Feature Engineering(TF-IDF)
由于china在三个文档中都出现了,所
以TF-IDF=0.0,而kungfu只在第一个
文档出现(说明是冷门词),却是第
一个文档中出现次数最多的,因此计
算出来的TF-
IDF=1.3862943611198906也是最高
的
Feature Engineering( Bucketizer )
 现在有推荐的需求,产品经理觉得把人分为50以上和50以下太不精准了,应该分为20岁以下,20-30岁,30-40
岁,36-50岁,50以上,那么就得用到数值离散化的处理方法了;
 离散化就是把特征进行适当的离散处理,比如上面所说的年龄是个连续的特征,但是我把它分为不同的年龄阶段
就是把它离散化了,这样更利于我们分析用户行为进行精准推荐;
 Bucketizer能方便的将一堆数据分成不同的区间;
Feature Engineering( 标准化和归一化)
标准化
将特征中的值进行标准差标准化,即转换为均值为0,
方差为1的正态分布;
标准化后的变量值围绕0上下波动,大于0说明高
于平均水平,小于0说明低于平均水平;
归一化
归一化就是将所有特征值都等比地缩小到0-1或
者-1到1之间的区间内。其目的是为了使特征都
在相同的规模中。
Feature Engineering
不断完善中、
接近sklearn的功能
Spark MLlib Regularization
Spark 在linear regression中提供了如下三种regularzation参数:
 L1:
 L2:
 Elastic net:
L1+L2结合的方式,即elastic net。这种方式同时兼顾特征
选择(L1)和权重衰减(L2)
Spark ML
Machine learning tasks consist of a (potentially complex) series of steps
 Data transformations, algorithm training, and model prediction;
 These steps can be viewed as a pipeline through which the data travels;
 Transformers & Estimators
Spark ML-pipeline
A Pipeline represents a series of steps in a
machine learning workflow:
 Each pipeline step can be either a transformer or an
estimator
 A Pipeline takes a DataFrame as input and produces a
PipelineModel as output
 A pipeline is itself is therefore an estimator
Spark Mllib实践02
银行信贷的信用风险example
我们需要预测什么?
• 某个人是否会按时还款
• 这就是标签:此人的信用度
你用来预测的“是与否”问题或者属性是什么?
• 申请人的基本信息和社会身份信息:职业,年龄,存款储蓄,婚姻状态等等……
• 这些就是特征,用来构建一个分类模型,你从中提取出对分类有帮助的特征信息。
随机森林模型
随机森林是分类和回归问题中一类常用的融
合学习方法。此算法基于训练数据的不同子
集构建多棵决策树,组合成一个新的模型。
德国人信用度数据集
{“信用”,“存款”,“期限”,“历史记录”,“目的”,“数额”,“储蓄”,“是否在职”,“婚姻”,
“担保人”,“居住时间”,“资产”,“年龄”,“历史信用”,“居住公寓”,“贷款”,“职业”,
“监护人”,“是否有电话”,“外籍”}
数据元数据定义&数据初始化
val rdd = sc.textFile("data/germancredit.csv")
val creditDF = parseRDD(rdd).map(parseCredit).toDF().cache()
creditDF.createOrReplaceTempView("credit")
特征工程
dataframe初始化之后,你可以用SQL命令查询数据了。下面是一些使用Scala DataFrame接口查询数据的例子:
计算数值型数据的统计信息,包括计数、均值、标准差、最小值和最大值。
//获取存款、数目、住居时长
sqlContext.sql("SELECT creditability, avg(balance) as avgbalance, avg(amount)
as avgamt, avg(duration) as avgdur FROM credit GROUP BY creditability ").show
creditDF.describe("balance").show
creditDF.groupBy("creditability").avg("balance").show
这些特征经过了变换,存入特征向量中,即一组表示各个维度特征值的数值向量;用VectorAssembler
方法将每个维度的特征都做变换,返回一个新的dataframe
val featureCols = Array("balance", "duration", "history", "purpose", "amount",
"savings", "employment", "instPercent", "sexMarried", "guarantors",
"residenceDuration", "assets", "age", "concCredit", "apartment",
"credits", "occupation", "dependents", "hasPhone", "foreign")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val df2 = assembler.transform(creditDF)
println("featureCols VectorAssembler:")
df2.show()
val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
df3.show()
数据集被分为训练数据和测试数据两个部分,70%的数据用来训练模型,30%的数据用来测试模型。
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)
• maxDepth:每棵树的最大深度。增加树的深度可以提高模型的效果,但是会延长训练时间。
• maxBins:连续特征离散化时选用的最大分桶个数,并且决定每个节点如何分裂。
• impurity:计算信息增益的指标
• auto:在每个节点分裂时是否自动选择参与的特征个数
• seed:随机数生成种子
接着,我们按照下列参数训练一个随机森林分类器:
val classifier = new RandomForestClassifier().setImpurity("gini").
setMaxDepth(3).setNumTrees(20)
.setFeatureSubsetStrategy("auto")
.setSeed(5043)
val model = classifier.fit(trainingData)
val evaluator = new BinaryClassificationEvaluator().setLabelCol("label")
val predictions = model.transform(testData)
model.toDebugString
模型训练
训练结果评估
然后,我们用BinaryClassificationEvaluator评估预测的效果,它将预测结果与样本的实际标签相比较,
返回一个准确度指标(ROC曲线所覆盖的面积)。本例子中,AUC达到81%。
val accuracy = evaluator.evaluate(predictions)
println("accuracy before pipeline fitting" + accuracy)
val rm = new RegressionMetrics(
predictions.select("prediction", "label").rdd.map(x =>
(x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
)
boston-house-prices
boston-house-prices
boston-house-prices
boston-house-prices
Spark-deep-learning实践03
Spark & tensorflow 结合方案
1. Elephas: Distributed DL with Keras & PySpark
2. Yahoo! Inc.: TensorFlowOnSpark
3. CERN Distributed Keras (Keras + Spark)
4. Qubole (tutorial Keras + Spark)
5. Intel Corporation: BigDL (Distributed Deep Learning Library for Apache Spark)
6. Deep Learning Pipelines
7. MLFlow
Spark-deep-learning架构
Spark-deep-learning架构
Spark-deep-learning架构
images of two persons
使用spark-deep-learning分区Steve Jobs 、Mark Zuckerberg、my baby
images of two persons
数据预处理
images of tree persons
数据模型训练
模型评估
images of two persons
训练参数优化
图片预测
images of two persons
代码地址:https://github.com/ouyangshourui/spark-deep-learning-example
资料推荐
Q&A

More Related Content

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Spark machine learning and deep learning practice