SECON'2017, Макарычев Костантин, Использование Spark для машинного обучения

•Download as PPTX, PDF•

0 likes•110 views

This document discusses machine learning in Apache Spark. It describes how Spark can be used for large-scale machine learning tasks through libraries like MLlib. It provides an example machine learning pipeline that preprocesses text data using tokenization and hashing, trains a logistic regression model, and saves the model for later use. The document also discusses serving machine learning models and different approaches for deploying Spark and machine learning applications in production.

Technology

MACHINE LEARNING IN SPARK
Константин Макарычев
secon 2017

val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor executor executor executor executor

preprocess preprocess train model
pipeline

apache spark 1
hadoop mapreduce 0
spark machine learning 1
tokenizer
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1

hashing tf
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1

logistic regression
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
0 72 -2.7138781446090308
0 94 0.9042505436914775
0 105 3.0835670890496645
…
0 495 3.2071722417080766
0 722 0.9042505436914775

val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
model.write.save("/tmp/spark-model")

val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()

data spark
data
scientist
cluster
model
web
app

data spark
data
scientist
cluster
model
web
appDB

data spark
data
scientist
cluster
model
web
applibs
deps
model
docker

data spark
data
scientist
cluster model
web
app
API

data spark
data
scientist
cluster model
web
app
API
serving
API

Hydrosphere Mist
https://github.com/hydrospheredata/mist

What's hot

Hadoop Online training by certified Lead Architect. This a a very real time training with lots of real time discussion and live troubleshooting sessions. Ideal for Admins and Developers. Basic Java session is included for those not familiar with Java terms. -Sessions Recordings are available . - 50 + Exercises session covering HDFS, MR , Pig, HBase, Hive and AWS-EMR. - Free Virtual Machine provided - Join Linked In session by Trainer to stay in touch even after the training.

Hadoop online training course

Kamal A

Big data solution capacity planning

Riyaz Shaikh

Qubole Overview at the Fifth Elephant Conference

Joydeep Sen Sarma

Big Data Ecosystem - 1000 Simulated Drones

Espeo Software

Beginner Apache Spark Presentation

Nidhin Pattaniyil

Do you gather metrics from your application? Can you combine them and easily generate custom graphs out of them? Can your developers measure whatever they want at any point of your application without breaking it or making it slower? In our next itnig friday, Víctor Martínez will show us how easy it is to roll on your own Graphite installation and how to use Etsy's statsd collector to flush your metrics. You will learn what Graphite is, how all of its components work, how to get your real time&historic metrics into Carbon, Graphite's database, and how to plot them in different manners. Víctor will show us some Graphite dashboards, alternative statds implementations, detailed common Graphite configuration gotchas, design limitations and how to deal with them. <a>Visit details</a>

Collecting metrics with Graphite and StatsD

itnig

Bizosys at fifth elephant

Abinasha Karana

The Meta of Hadoop - COMAD 2012

Joydeep Sen Sarma

Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays

CAPSiDE

Hadoop 101 - Big Data Technology

Firman Gautama

2014 moore-ddd

c.titus.brown

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Graphite

Adrian Moisey

Getting Started on Hadoop

Paco Nathan

Engineering fast indexes

Daniel Lemire

Everybody wants to do big data analytics these days: storage is cheapand data is plentiful; best of all, software in the Hadoop ecosystem is free both as in speech and as in beer. If you are not Facebook or Amazon, however, you are not likely to put your precious data in the systems of cloud providers you may not trust; on the other hand, developing your own small or medium cluster can be prohibitive, since it requires a lot of effort and specialization to be deployed, tuned and maintained. BigFoot aims to simplify the data scientist's life, making the existing big data software easier to deploy and tune, so that data scientists can focus on their job: getting insight from data. BigFoot contributes to OpenStack: we made it possible to deploy virtualized Spark clusters, enabling analytics-as-a-service using fast in-memory computation. HFSP, our scheduler for Hadoop Mapreduce, gives priority to smaller jobs, so that large batch jobs do not harm user productivity by slowing down quicker data exploration jobs. Interestingly, HFSP achieves this without penalizing large jobs. We also contribute to the Apache Pig high-level analytics language: we propose patches that strongly enhance performance when computing aggregations on multi-dimensional data.

BigFoot: Big Data For Every Organization

Matteo Dell'Amico

Barcelona MUG MongoDB + Hadoop Presentation

Norberto Leite

Using Alluxio as a Fault Tolerant Pluggable Optimization Component to Compute...

Alluxio, Inc.

Introduction to Apache Hivemall v0.5.2 and v0.6

Makoto Yui

PAC 2019 virtual Stefano Doni

Neotys

What's hot (20)

Hadoop online training course

Big data solution capacity planning

Qubole Overview at the Fifth Elephant Conference

Big Data Ecosystem - 1000 Simulated Drones

Beginner Apache Spark Presentation

Collecting metrics with Graphite and StatsD

Bizosys at fifth elephant

The Meta of Hadoop - COMAD 2012

Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays

Hadoop 101 - Big Data Technology

2014 moore-ddd

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Graphite

Getting Started on Hadoop

Engineering fast indexes

BigFoot: Big Data For Every Organization

Barcelona MUG MongoDB + Hadoop Presentation

Using Alluxio as a Fault Tolerant Pluggable Optimization Component to Compute...

Introduction to Apache Hivemall v0.5.2 and v0.6

PAC 2019 virtual Stefano Doni

Similar to SECON'2017, Макарычев Костантин, Использование Spark для машинного обучения

DataScience Lab, 13 мая 2017 Cервинг моделей, построенных на больших данных с помощью Apache Spark Степан Пушкарев (GM (Kazan) at Provectus / CTO at Hydrosphere.io) После подготовки данных и обучения моделей на больших данных с использованием Apache Spark встает вопрос о том, как использовать обученные модели в реальных приложениях. Помимо модели важно не забывать про весь пайплайн пре-процессинга данных, который должен попасть в продакшн в том виде, в котором его спроектировал и реализовал дата саентист. Такие решения, как PMML/PFA, основанные на экспорте/импорте модели и алгоритма имеют очевидные недостатки и ограничения. В данном докладе мы предложим альтернативное решение, которое упрощает процесс использования моделей и пайплайнов в реальных боевых приложениях. Все материалы доступны по ссылке: http://datascience.in.ua/report2017

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...

GeeksLab Odessa

My talk at Data Science Labs conference in Odessa. Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions. There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.

Spark ML Pipeline serving

Stepan Pushkarev

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

scalaconfjp

PySaprk

Giivee The

OCF.tw's talk about "Introduction to spark"

Giivee The

Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

DataStax Academy

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Chetan Khatri

Intro to Spark and Spark SQL

jeykottalam

http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks. To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

MapR Technologies

Osd ctw spark

Wisely chen

Overview of stinger interactive query for hive

David Kaiser

Big Data Processing with .NET and Spark (SQLBits 2020)

Michael Rys

Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed: • What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them? • When to use batch and when stream processing? • What is a Lambda-Architecture and a Kappa Architecture? • What are the best practices for your project?

20170126 big data processing

Vienna Data Science Group

http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Apache Spark & Hadoop

MapR Technologies

End-to-end Data Pipeline with Apache Spark

Databricks

xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

Radu Chilom

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

20130912 YTC_Reynold Xin_Spark and Shark

YahooTechConference

With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the existing Qthreads backend of Chapel.

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

Akihiro Hayashi

Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017

Amazon Web Services

SparkR: Enabling Interactive Data Science at Scale on Hadoop

DataWorks Summit

Similar to SECON'2017, Макарычев Костантин, Использование Spark для машинного обучения (20)

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...

Spark ML Pipeline serving

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

PySaprk

OCF.tw's talk about "Introduction to spark"

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Intro to Spark and Spark SQL

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Osd ctw spark

Overview of stinger interactive query for hive

Big Data Processing with .NET and Spark (SQLBits 2020)

20170126 big data processing

Apache Spark & Hadoop

End-to-end Data Pipeline with Apache Spark

xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

20130912 YTC_Reynold Xin_Spark and Shark

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017

SparkR: Enabling Interactive Data Science at Scale on Hadoop

Recently uploaded

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

You’ve heard good data matters in Machine Learning, but does it matter for Generative AI applications? Corporate data often differs significantly from the general Internet data used to train most foundation models. Join me for a demo on building an open source RAG (Retrieval Augmented Generation) stack using Milvus vector database for Retrieval, LangChain, Llama 3 with Ollama, Ragas RAG Eval, and optional Zilliz cloud, OpenAI.

Introduction to Open Source RAG and RAG Evaluation

Zilliz

We're living the AI revolution and Salesforce is adapting and bring new value to their customers. Einstein products are evolving rapidly and navigating their limitations, language support, and use cases can be challenging. Let's make review of what Einstein product are available currently, what are the capabilities and what can be used for in CEE region and how Rossie.ai can help to learn Salesforce speak Czech. We will explore the Einstein roadmap and I will make a short live demo (based on your vote) of some Einstein feature.

AI revolution and Salesforce, Jiří Karpíšek

CzechDreamin

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Ever caught yourself nodding along when someone mentions "delivering value" in Agile, but secretly wondering what the heck they actually mean? You're not alone! Join us for an eye-opening session where we'll strip away the buzzwords and dive into the heart of Agile—value delivery. But what is "value"? Is it a mythical unicorn in the world of software development, or is there more to this overused term? This isn't going to be a sit-and-get lecture. We're talking about a face-to-face, interactive meetup where YOU play a crucial role. Come along to: Define It: What does "value" really mean? We’ll build a definition that’s not just words, but a compass for your Agile journey. Contextualise It: Discover what value means specifically to you, your team, your company, and your industry. Because one size does not fit all. Deliver It: Share strategies and gather new ones for uncovering and delivering true value—no more shooting in the dark!

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

David Michel

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

New customer? New industry? New cloud? New team? A lot to handle! How to ensure the success of the project? Start it well! I've created the 3 areas of focus at the beginning of the project that helped me in multiple roles (BA, PO, and Consultant). Learn from real-world experiences and discover how these insights can empower you to deliver unparalleled value to your customers right from the project's start.

Powerful Start- the Key to Project Success, Barbara Laskowska

CzechDreamin

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

IoT Analytics Company Presentation May 2024

IoTAnalytics

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

ODC, Data Fabric and Architecture User Group

CatarinaPereira64715

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

The standard Salesforce Approval process can be limiting in many ways, especially in complex scenarios. What if there was a way to implement very flexible approvals where one can use Apex code to make data updates in unrelated records, dynamically generate next steps details, and compute assignees on the fly? And still use UI-based configurations to implement concrete approval processes. In this session, we will share ideas behind such a solution and show a few lines of code to get you started.

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

CzechDreamin

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

When stars align: studies in data quality, knowledge graphs, and machine lear...

Introduction to Open Source RAG and RAG Evaluation

AI revolution and Salesforce, Jiří Karpíšek

Connector Corner: Automate dynamic content and events by pushing a button

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Powerful Start- the Key to Project Success, Barbara Laskowska

Mission to Decommission: Importance of Decommissioning Products to Increase E...

IoT Analytics Company Presentation May 2024

Bits & Pixels using AI for Good.........

Key Trends Shaping the Future of Infrastructure.pdf

Search and Society: Reimagining Information Access for Radical Futures

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

How world-class product teams are winning in the AI era by CEO and Founder, P...

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

ODC, Data Fabric and Architecture User Group

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

SECON'2017, Макарычев Костантин, Использование Spark для машинного обучения

1. MACHINE LEARNING IN SPARK Константин Макарычев secon 2017

2. Big Data: Volume, Velocity, Variety

3. Apache Spark http://spark.apache.org/

4. val wordCounts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) executor executor executor executor executor

5. SQL, Streaming, GraphX, MLlib

6. Machine Learning: training + serving

7. preprocess preprocess train model pipeline

8. apache spark 1 hadoop mapreduce 0 spark machine learning 1 tokenizer [apache, spark] 1 [hadoop, mapreduce] 0 [spark, machine, learning] 1

9. hashing tf [apache, spark] 1 [hadoop, mapreduce] 0 [spark, machine, learning] 1 [105, 495], [1.0, 1.0] 1 [6, 638, 655], [1.0, 1.0, 1.0] 0 [105, 72, 852], [1.0, 1.0, 1.0] 1

10. logistic regression [105, 495], [1.0, 1.0] 1 [6, 638, 655], [1.0, 1.0, 1.0] 0 [105, 72, 852], [1.0, 1.0, 1.0] 1 0 72 -2.7138781446090308 0 94 0.9042505436914775 0 105 3.0835670890496645 … 0 495 3.2071722417080766 0 722 0.9042505436914775

11. val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training) model.write.save("/tmp/spark-model")

12. preprocess preprocess model pipeline

13. val test = spark.createDataFrame(Seq( ("spark hadoop"), ("hadoop learning") )).toDF("text") val model = PipelineModel.load("/tmp/spark-model") model.transform(test).collect()

14. ./bin/spark-submit …

15.

16.

17. data spark data scientist cluster model web app

18. data spark data scientist cluster model web appDB

19. data spark data scientist cluster model web applibs deps model docker

20. data spark data scientist cluster model web app API

21. data spark data scientist cluster model web app API serving API

22. Hydrosphere Mist https://github.com/hydrospheredata/mist