Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).
Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.
On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.
From http://www.csdn.net/article/2015-12-17/2826501
《小米金融技术主管方流: 大数据在互联网金融中的应用》
方流在主题演讲中重点介绍了DW建设的业务架构及开发工具,包括log利器Scribe、ETL利器之Hadoop/Hdfs、DW利器之HBase、数据分析利器Hive/Sentry、OLAP利器Impala、数据迁移利器之sqoop、机器学习利器之spark。同时重点分析了用户金融画像并针对大数据反欺诈,给出了自己的探索实践,防止盗号,提供异常环境监测/手机验证;防止身份伪造,采用实名认证;鉴定虚假资料,进行交叉验证。
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15MLconf
Spark DataFrames and ML Pipelines: In this talk, we will discuss two recent efforts in Spark to scale up data science: distributed DataFrames and Machine Learning Pipelines. These components allow users to manipulate distributed datasets and handle complex ML workflows, using intuitive APIs in Python, Java, and Scala (and R in development).
Data frames in R and Python have become standards for data science, yet they do not work well with Big Data. Inspired by R and Pandas, Spark DataFrames provide concise, powerful interfaces for structured data manipulation. DataFrames support rich data types, a variety of data sources and storage systems, and state-of-the-art optimization via the Spark SQL Catalyst optimizer.
On top of DataFrames, we have built a new ML Pipeline API. ML workflows often involve a complex sequence of processing and learning stages, including data cleaning, feature extraction and transformation, training, and hyperparameter tuning. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we built simple APIs to help users quickly assemble and tune practical ML pipelines.
From http://www.csdn.net/article/2015-12-17/2826501
《小米金融技术主管方流: 大数据在互联网金融中的应用》
方流在主题演讲中重点介绍了DW建设的业务架构及开发工具,包括log利器Scribe、ETL利器之Hadoop/Hdfs、DW利器之HBase、数据分析利器Hive/Sentry、OLAP利器Impala、数据迁移利器之sqoop、机器学习利器之spark。同时重点分析了用户金融画像并针对大数据反欺诈,给出了自己的探索实践,防止盗号,提供异常环境监测/手机验证;防止身份伪造,采用实名认证;鉴定虚假资料,进行交叉验证。
Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin
Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application.
Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.
How to Build a Recommendation Engine on SparkCaserta
How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin
Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application.
Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.
How to Build a Recommendation Engine on SparkCaserta
How to Build a Recommendation Engine on Spark was a presentation given by Joe Caserta, CEO and founder of Caserta Concepts, at @AnalyticsWeek in Boston.
Boston's Data AnalyticsStreet Conference is a 2 day packed event with thought provoking keynotes, knowledge filled sessions, intense workshops, insightful panels, and real-world case studies - engaging analytics community with latest methodologies and trends. The conference encompasses largest Speaker-to-Attendee ratio for unmatched networking and learning opportunity.
For more information on the services and solutions Caserta Concepts offers, visit our website at http://casertaconcepts.com/.
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
3. Machine Learning
• A branch of AI; A scientific discipline concerned with the design
and development of algorithms that take as input empirical
data, and yield patterns or predictions thought to be features of
the underlying mechanism that generated the data
9. Automatic Machine Translation
Basics
• What is Machine Translation ( MT )?
• What is Machine Translation Evaluation?
More art than science:
数学+语言学+计算机科学
信、达、雅? 结果质量想多了! 多大程度上,某项任务
系统开发者:体现系统问题,修复完善翻译系统
用户:系统可用性
13. State-of-the-art
• BLEU系统 (Bilingual Evaluation Understudy)
• 准确率(precision)与召回率(recall)
• N-gram match
• METEOR系统
• 在BLEU的基础上增加了词根还原和同义词检测
• 评价标准
System A: Israeli officials responsibility of airport safety
Reference: Israeli officials are responsible for airport security
System B: airport security Israeli officials are responsible
BLEU: A is better
METEOR: B is better
14. ML MTE EXAMPLE
(EnglishJapanese 2010)
• Classification yet regression model-like, i.e. continuous score
Literal translation unnaturalness in MT
Reveal unnatural translations by some classification features
Classification accuracy + comparison with manual evaluation result
16. EXAMPLE:
• solution:
• Classification feature: aligned word pairs feature
• SVM
• Scores? The distance to the hyper-plane
• Experiments and test:
流程:
比较各个模型的性能,相关系数
17. MT Evaluation Key Techniques
(so far)
• including but not limited to:
SVM
I. popular/successful
II. various problem:
a) linearly separable : maximal margin hyperlane
b) linearly inseparable: soft margin, noise
c) nonlinear: 属性变换,或者 kernel
III. robust
IV. learning problem solved using
18. Goal & Motivation
• 一、实现上述论文中的基于分类器评价系统算法
• 二、实现其他论文中基于回归的评价系统
• 三、混合两种评价系统
Motivation:
• 因为本身对语言比较感兴趣,对机器翻译问题好奇。
• 机器学习是以后学习研究的主要工具,这次毕设相当于一次入门和实
践
• Challenge: 评测则是 more art than science , 不好定量。需要多了
解别人的思想
Literal translation often makes MTs unnatural. Therefore, a critical issue in MT development is how to deal with literal translation. If a classifier uses classification features that reveal unnatural literal translations, classification results will reveal problematic MTs. Reducing unnatural literal translations would improve MTs
采用SVM学习得到分类器。以测试实例到超平面的距离为该测试实例得评价分数。
–SVM is among the most popular/successful classifiers – It provides a principled way to learn a robust classifier (i.e., a decision boundary) – chooses the one with maximum margin principle – has sound theoretical guarantee – extends to nonlinear decision boundary by using kernel trick – learning problem efficiently solved using convex optimization techniques