Spark Streamingによるリアルタイムユーザ属性推定

•

5 likes•1,900 views

Yoshiyasu SAEKI

Spark Meetup December 2015 http://connpass.com/event/23159/ 発表資料

Data & Analytics

Spark Streaming
/ @laclefyoshi
<ysaeki@r.recruit.co.jp>

•
• Spark Streaming
•
•
• Spark Streaming Tips
•
2

: / SAEKI Yoshiyasu
:
IT
: Web 4 9
R&D
Hadoop, Kafka, Storm, Spark, Druid
: RICOH Theta ( ) + Google Cardboard
3

Spark Streaming
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
4

•
• =
•
•
http://www.recruit.jp/company/about/structure.html
6

: ﬂuentd → Kafka
• ﬂuent-plugin-kafka
• https://github.com/htgc/ﬂuent-plugin-kafka
• output type = kafka_buffered (on ﬁle)
• Kafka 0.8.2.2
• 0.9.0
• ACL
9

Suro
• Netﬂix
• https://github.com/Netﬂix/suro
• : Kafka Consumer API Thrift API
• :
• HDFS
• AWS S3
• Kafka Producer
• Elasticsearch
•
11
LinkedIn
Gobblin

Hadoop
•
• HDFS
• MLlib  
• Streaming linear regression (Classiﬁcation)
• Streaming k-means (Clustering)
•
12

Kafka
• Direct Approach (>= Spark 1.3)
•
• Exactly-once
• Kafka Simple Consumer API
Direct Approach
14

Spark Streaming 1
15
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4

Spark Streaming 2
16
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html

Window-based micro-batch
1
1Micro-batch1Micro-batch
18

$Micro-batch • RDD HBase dstream.foreachRDD { rdd => val hbaseConf = createHbaseConfiguration() val jobConf = new Configuration(hbaseConf) jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName) jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName) jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName) new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf) } // RDD[(String, Map[K,V])] RDD[(String, Put)] def hbaseConvert(t:(String, Map[String, String])) = { val p = new Put(Bytes.toBytes(t._1)) t._2.toSeq.foreach( m => p.addColumn(Bytes.toBytes("seg"), Bytes.toBytes(m._1), Bytes.toBytes(m._2)) ) (t._1, p) } 19 0.5 1$

Spark Streaming :
• DStream RDD
• Spark  
Spark Streaming
21
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html

Spark Streaming :
• Fault Tolerance
• Micro-batch
• YARN
• YARN Dynamic Resource Allocation
•
22

Spark Streaming :
• : →  
RDD → RDD DStream → DStream
• 1Micro-batch
23
// RDD → RDD
val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c"))
// DStream → DStream
val queue = scala.collection.mutable.Queue(rdd)
val dstream:DStream[String] =
sparkStreamingContext.queueStream(queue)

$Spark Streaming : • spark-testing-base • https://github.com/holdenk/spark-testing-base class JsonElementCountTest extends StreamingSuiteBase { test("simple") { val input = List(List("aa"), List("bb")) val expected = List(List("AA"), List(“BB")) testOperation[String, String]( input, converterMethod _, expected, useSet = true) }  } 24$

Spark Streaming :
• Window-based micro-batch
•
• o.a.spark.streaming.util.ManualClock 
• private class Scala
• http://mkuthan.github.io/blog/2015/03/01/spark-
unit-testing/
25

Spark Streaming :
• Scala Java
•
• Spark Streaming Kafka HBase Scala
• Java
26
// api/java/JavaRDD.scala
object JavaRDD {
implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] =
new JavaRDD[T](rdd)
implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd
}

27
•
•
• =
• Spark Streaming
• MLlib
• GraphX

What's hot

Voldemortの紹介Yoshiyasu SAEKI

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll

PWL: One VM to Rule Them AllAysylu Greenberg

Facebook Presto presentationCyanny LIANG

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent

Ruby and Distributed Storage SystemsSATOSHI TAGOMORI

Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...confluent

Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks

使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化－曾書庭台灣資料科學年會

Technologies for Data Analytics PlatformN Masahiro

Apache Kafka lessons learned @PAYBACKMaxim Shelest

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll

Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architectureconfluent

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly

Api world apache nifi 101Timothy Spann

Apache Pulsar Community-JenniferStreamNative

What's hot (20)

Voldemortの紹介

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...

PWL: One VM to Rule Them All

Facebook Presto presentation

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...

Ruby and Distributed Storage Systems

Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...

Spark Compute as a Service at Paypal with Prabhu Kasinathan

使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化－曾書庭

Technologies for Data Analytics Platform

Apache Kafka lessons learned @PAYBACK

Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017

Gwen Shapira, Confluent | Kafka Summit 2020 Keynote | Kafka’s New Architecture

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...

Api world apache nifi 101

Apache Pulsar Community-Jennifer

Viewers also liked

ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI

Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~sugiyama koki

Apache Sparkに手を出してヤケドしないための基本～「Apache Spark入門より」～（デブサミ 2016 講演資料）NTT DATA OSS Professional Services

ビッグじゃなくても使えるSpark Streamingchibochibo

Apache Spark の紹介（前半：Sparkのキホン）NTT DATA OSS Professional Services

Fast Distributed Online Classification DataWorks Summit/Hadoop Summit

Training Large-scale Ad Ranking Models in SparkPatrick Pletscher

Run Spark on EMRってどんな仕組みになってるの？Satoshi Noto

Apache Spark: The Next Gen toolset for Big Data Processingprajods

2015-01-27 Introduction to DockerShuji Yamada

'Flume' Case StudyPriyankaRadha

Tokyo webmining発表資料 20111127kan_yukiko

Apache flumeRamakrishna kapa

テキストマイニングで発掘！？売上とユーザーレビューの相関分析Shintaro Takemura

データセンタにおける消費電力のお話Koji Suganuma

Way of Experiment & EvaluationTatsuya Coike

Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December 2015/12/09MapR Technologies Japan

FreeBSD on MacYuichiro Naito

KibanaTorstein Hansen

Apache SparkについてBrainPad Inc.

Viewers also liked (20)

ストリーム処理を支えるキューイングシステムの選び方

Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~

Apache Sparkに手を出してヤケドしないための基本～「Apache Spark入門より」～（デブサミ 2016 講演資料）

ビッグじゃなくても使えるSpark Streaming

Apache Spark の紹介（前半：Sparkのキホン）

Fast Distributed Online Classification

Training Large-scale Ad Ranking Models in Spark

Run Spark on EMRってどんな仕組みになってるの？

Apache Spark: The Next Gen toolset for Big Data Processing

2015-01-27 Introduction to Docker

'Flume' Case Study

Tokyo webmining発表資料 20111127

Apache flume

テキストマイニングで発掘！？売上とユーザーレビューの相関分析

データセンタにおける消費電力のお話

Way of Experiment & Evaluation

Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December 2015/12/09

FreeBSD on Mac

Kibana

Apache Sparkについて

Similar to Spark Streamingによるリアルタイムユーザ属性推定

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks

Introduction Apache KafkaJoe Stein

Scalding by Adform Research, Alex GryzlovVasil Remeniuk

ETL with SPARK - First Spark London meetupRafal Kwasny

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly

Flink September 2015 Community UpdateRobert Metzger

15年前に作ったアプリを現在に蘇らせてみた話Naoki Nagazumi

PySpark Best PracticesCloudera, Inc.

リバースプロキシでwebサーバを集約ついでにdocker化しようYasunori Kuji

Top 5 mistakes when writing Streaming applicationshadooparchbook

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Ingesting hdfs intosolrusingsparktrimmedwhoschek

Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein

IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & MobileAmazon Web Services Japan

Apache Kafka 0.8 basic training - VerisignMichael Noll

Introduction to real time big data with Apache SparkTaras Matyashovsky

Spark Summit EU talk by Jim DowlingSpark Summit

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Similar to Spark Streamingによるリアルタイムユーザ属性推定 (20)

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

Introduction Apache Kafka

Scalding by Adform Research, Alex Gryzlov

ETL with SPARK - First Spark London meetup

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Flink September 2015 Community Update

15年前に作ったアプリを現在に蘇らせてみた話

PySpark Best Practices

リバースプロキシでwebサーバを集約ついでにdocker化しよう

Top 5 mistakes when writing Streaming applications

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...

Ingesting hdfs intosolrusingsparktrimmed

Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra

IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile

Apache Kafka 0.8 basic training - Verisign

Introduction to real time big data with Apache Spark

Spark Summit EU talk by Jim Dowling

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Real time Analytics with Apache Kafka and Apache Spark

Recently uploaded

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

RadioAdProWritingCinderellabyButleri.pdfgstagge

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

ASML's Taxonomy Adventure by Daniel Cantervoginip

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Recently uploaded (20)

Call Girls in Saket 99530🔝 56974 Escort Service

Customer Service Analytics - Make Sense of All Your Data.pptx

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一

20240419 - Measurecamp Amsterdam - SAM.pdf

RadioAdProWritingCinderellabyButleri.pdf

GA4 Without Cookies [Measure Camp AMS]

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Call Girls In Dwarka 9654467111 Escorts Service

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

E-Commerce Order PredictionShraddha Kamble.pptx

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

ASML's Taxonomy Adventure by Daniel Canter

04242024_CCC TUG_Joins and Relationships

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Spark Streamingによるリアルタイムユーザ属性推定

1. Spark Streaming / @laclefyoshi <ysaeki@r.recruit.co.jp>

2. • • Spark Streaming • • • Spark Streaming Tips • 2

3. : / SAEKI Yoshiyasu : IT : Web 4 9 R&D Hadoop, Kafka, Storm, Spark, Druid : RICOH Theta ( ) + Google Cardboard 3

4. Spark Streaming http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html 4

5. 5

6. • • = • • http://www.recruit.jp/company/about/structure.html 6

7. • • ≒ … • • ! OS etc. 7

8. 1. Web   (JavaScript) 2. ﬂuentd Kafka 8

9. : fluentd → Kafka • fluent-plugin-kafka • https://github.com/htgc/fluent-plugin-kafka • output type = kafka_buffered (on file) • Kafka 0.8.2.2 • 0.9.0 • ACL 9

10. 10

11. Suro • Netﬂix • https://github.com/Netﬂix/suro • : Kafka Consumer API Thrift API • : • HDFS • AWS S3 • Kafka Producer • Elasticsearch • 11 LinkedIn Gobblin

12. Hadoop • • HDFS • MLlib   • Streaming linear regression (Classiﬁcation) • Streaming k-means (Clustering) • 12

13. Spark Streaming 13

14. Kafka • Direct Approach (>= Spark 1.3) • • Exactly-once • Kafka Simple Consumer API Direct Approach 14

15. Spark Streaming 1 15 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4

16. Spark Streaming 2 16 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html

17. Micro-batch 17 1Micro-batch (Cookie)

18. Window-based micro-batch 1 1Micro-batch1Micro-batch 18

19. Micro-batch • RDD HBase dstream.foreachRDD { rdd => val hbaseConf = createHbaseConfiguration() val jobConf = new Configuration(hbaseConf) jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName) jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName) jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName) new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf) } // RDD[(String, Map[K,V])] RDD[(String, Put)] def hbaseConvert(t:(String, Map[String, String])) = { val p = new Put(Bytes.toBytes(t._1)) t._2.toSeq.foreach( m => p.addColumn(Bytes.toBytes("seg"), Bytes.toBytes(m._1), Bytes.toBytes(m._2)) ) (t._1, p) } 19 0.5 1

20. 20

21. Spark Streaming : • DStream RDD • Spark   Spark Streaming 21 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html

22. Spark Streaming : • Fault Tolerance • Micro-batch • YARN • YARN Dynamic Resource Allocation • 22

23. Spark Streaming : • : →   RDD → RDD DStream → DStream • 1Micro-batch 23 // RDD → RDD val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c")) // DStream → DStream val queue = scala.collection.mutable.Queue(rdd) val dstream:DStream[String] = sparkStreamingContext.queueStream(queue)

24. Spark Streaming : • spark-testing-base • https://github.com/holdenk/spark-testing-base class JsonElementCountTest extends StreamingSuiteBase { test("simple") { val input = List(List("aa"), List("bb")) val expected = List(List("AA"), List(“BB")) testOperation[String, String]( input, converterMethod _, expected, useSet = true) }  } 24

25. Spark Streaming : • Window-based micro-batch • • o.a.spark.streaming.util.ManualClock  • private class Scala • http://mkuthan.github.io/blog/2015/03/01/spark- unit-testing/ 25

26. Spark Streaming : • Scala Java • • Spark Streaming Kafka HBase Scala • Java 26 // api/java/JavaRDD.scala object JavaRDD { implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] = new JavaRDD[T](rdd) implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd }

27. 27 • • • = • Spark Streaming • MLlib • GraphX

Spark Streamingによるリアルタイムユーザ属性推定

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Streamingによるリアルタイムユーザ属性推定

Similar to Spark Streamingによるリアルタイムユーザ属性推定 (20)

Recently uploaded

Recently uploaded (20)

Spark Streamingによるリアルタイムユーザ属性推定