SlideShare a Scribd company logo
Analytics for the fastest growing companies
Personalised
emails with Python
and Spark
Tomáš Sirný, 12. 3. 2016
@junckritter
Agenda
● MapReduce
● Introduction to Apache Spark
● Spark + Python
● Use-case: Personalisation of email newsletters
About me
● Python developer
● Web development in Django
● Movie search with Elasticsearch
● Data Science
Problem of Big Data
● data are “BIG” and on lot of places
● hard and costly to get it all together at once
● slow to process
MapReduce
● MR is programming paradigm that allows for massive scalability
across hundreds or thousands of servers in a cluster
● do something with every small chunk
● choose only wanted ones
● put them together
● collect & save result
Hadoop
● Open-source implementation of Google’s BigTable
● Uses MapReduce
● In version 2 introduced YARN (Yet Another Resource Negotiator)
framework for managing resources
● Inputs and results of each phase are saved to files
● Complex configuration of jobs, not easy connection from non-JVM
languages
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data
processing
● fast - faster MR than Hadoop, data are in memory
● general engine - multipurpose (data transformation, machine
learning, …)
● large-scale - runs in parallel on large clusters
● processing - filter, transform, save or send
RDD Resilient Distributed Dataset
● basic data structure in Spark
● immutable distributed collection of objects
● divided into logical partitions
● computed on different nodes of the cluster
● equivalent of table in SQL database
Python shell included
(spark)tomas@Fenchurch:~/personal$ spark-1.5.2-bin-hadoop2.6/bin/pyspark
Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/03/11 22:16:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-
java classes where applicable
16/03/11 22:16:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/03/11 22:16:16 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 1.5.2
/_/
Using Python version 2.7.10 (default, Oct 23 2015 18:05:06)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x1071bb810>
>>> x = sc.parallelize([1, 2, 3, 4, 5])
>>> x.count()
5
>>> x.sortBy(lambda item: item, ascending=False).collect()
[5, 4, 3, 2, 1]
Run from python in virtualenv
os.environ['SPARK_HOME'] = os.path.join(config['spark_home'])
os.environ['PYTHONPATH'] = os.path.join(config['spark_home'], 'python')
os.environ['PYSPARK_PYTHON'] = config['python']
sys.path.insert(
0,
os.path.join(config['spark_home'], "python"
) # pyspark
sys.path.insert(
0,
os.path.join(config[‘spark_home’], 'python/lib/py4j-0.8.2.1-src.zip')
) #py4j
IPython/Jupyter notebook
IPython/Jupyter notebook
Define SparkContext
sc = SparkContext(
appName='GenerateEmails',
master='yarn-client’,
pyFiles='emails.zip'
)
zip -r emails.zip * -x "*.pyc" -x "*.log"
Simple example
def get_delivered_rdd(data, date_str, mailing=None):
delivered_rdd_raw = data 
.filter(lambda x: x['type'] == 'campaign') 
.filter(lambda x: x['properties']['status'] == 'delivered') 
.filter(clicked_actual_date(date_str))
if mailing:
delivered_rdd_raw = delivered_rdd_raw 
.filter(lambda x: mailing in x['properties']['mailing'])
return delivered_rdd_raw
Complex example
def get_updated_customers(
sc, base_path, project_id, dates='*', user_filter=None,
customers=None):
if customers is None:
customers = sc.emptyRDD()
src = os.path.join(base_path, project_id, "update_customer", dates)
updates_raw = sc.textFile(src)
updates = updates_raw 
.flatMap(json_to_item) 
.sortBy(lambda x: x['timestamp']) 
.filter(lambda x: x is not None) 
.groupByKey() 
.map(filter_manual_customers(user_filter)) 
.filter(lambda x: x is not None) 
.leftOuterJoin(customers) 
.map(join_values)
return updates.collectAsMap()
General rule: do as much as possible in Spark
● count()
● take(n)
● combineByKey()
● distinct()
● countByKey()
● foreach()
● groupBy()
● union(), intersection()
● join(), leftOuterJoin(),
● reduceByKey()
● sortBy(), sortByKey()
Use case:
personalised emails
Requirements:
● Set of defined sections in email - items from categories
● Set of customers subscribed to different newsletters
● Choose best N sections for each customer, based on her
activity
Infrastructure
Event
{
"type": "add_event",
"data": {
"timestamp": 1444043432.0,
"customer_id": "c1",
"type": "campaign",
"company_id": "ffe66a48-f341-11e4-8cbf-b083fedeed2e",
"properties": {
"status": "enqueued",
"sections": {
"s-test1": ["164377", "157663", "165109", "159075", "153851", "161695"],
"s-test2": ["162363", "152249", "162337", "156861", "162109", "165021"],
"s-test3": ["115249", "150349", "148291", "148265", "157581", "159479"]
}
}
}
}
Algorithm
S-01 S-02 S-03 S-04 S-05 S-06 S-07 S-08 S-09 S-10
C-01 0.5 0.555 0.23 0.11 0.734 0.93 0.34 0.66 0.85 0.15
C-02 0.4 0.955 0.13 0.76 0.833 0.53 0.74 0.84 0.585 0.45
C-42 0.67 0.555 0.73 0.11 0.234 0.93 0.34 0.66 0.85 0.15
C-43 0.8 0.555 0.33 0.51 0.79 0.43 0.14 0.46 0.55 0.85
C-103 0.9 0.335 0.27 0.11 0.734 0.93 0.34 0.86 0.65 0.15
Algorithm
S-01 S-02 S-03 S-04 S-05 S-06 S-07 S-08 S-09 S-10
C-01 0.5 0.555 0.23 0.11 0.734 0.93 0.34 0.66 0.85 0.15
C-02 0.4 0.955 0.13 0.76 0.833 0.53 0.74 0.84 0.585 0.45
C-42 0.67 0.555 0.73 0.11 0.234 0.93 0.34 0.66 0.85 0.15
C-43 0.8 0.555 0.33 0.51 0.79 0.43 0.14 0.46 0.55 0.85
C-103 0.9 0.335 0.27 0.11 0.734 0.93 0.34 0.86 0.65 0.15
Spark job
● Load data from JSON files on MapR-FS (distributed file-system)
● Create profile of each customer - email address, subscriptions, history of
actions (clicks, views, purchases, …)
● Filter by attributes (subscription, preferences)
● Create customer-specific parameters for main algorithm (which sections
she clicked, which mails she opened)
Python program
● Run Spark job
● Read customer records
● Feed them to main optimisation algorithm
● Generate & send personalised emails
Take-aways
● It’s easy to start with Spark through Python
● Few lines of code can process massive data
○ Map-Reduce, SQL, Graphs, Machine Learning
● Quickly becoming standard for data-science
and data-science is a future of web applications
Thanks!

More Related Content

What's hot

Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Databricks
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
 
opentsdb in a real enviroment
opentsdb in a real enviromentopentsdb in a real enviroment
opentsdb in a real enviroment
Chen Robert
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryMonitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Nader Ganayem
 
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas EricssonOSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
NETWAYS
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
Tomáš Kypta
 
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Big Data Spain
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Bartosz Konieczny
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
Bartosz Konieczny
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache Flink
Juan Fumero
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Romain Dorgueil
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
Alexey Lesovsky
 
Taskerman: A Distributed Cluster Task Manager
Taskerman: A Distributed Cluster Task ManagerTaskerman: A Distributed Cluster Task Manager
Taskerman: A Distributed Cluster Task Manager
Raghavendra Prabhu
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
Knoldus Inc.
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
Petr Zapletal
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
Rick Chang
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
Mark Wong
 

What's hot (20)

Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
opentsdb in a real enviroment
opentsdb in a real enviromentopentsdb in a real enviroment
opentsdb in a real enviroment
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryMonitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
 
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas EricssonOSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache Flink
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Taskerman: A Distributed Cluster Task Manager
Taskerman: A Distributed Cluster Task ManagerTaskerman: A Distributed Cluster Task Manager
Taskerman: A Distributed Cluster Task Manager
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Introduction to ScalaZ
Introduction to ScalaZIntroduction to ScalaZ
Introduction to ScalaZ
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 

Similar to PyCon 2016: Personalised emails with Spark and Python

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Jena University Talk 2016.03.09 -- SQL at Zalando Technology
Jena University Talk 2016.03.09 -- SQL at Zalando TechnologyJena University Talk 2016.03.09 -- SQL at Zalando Technology
Jena University Talk 2016.03.09 -- SQL at Zalando Technology
Valentine Gogichashvili
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
Sematext Group, Inc.
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
marpierc
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
Max Kleiner
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
Ron Reiter
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
Javan Rasokat
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart SystemsFIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009marpierc
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 

Similar to PyCon 2016: Personalised emails with Spark and Python (20)

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Jena University Talk 2016.03.09 -- SQL at Zalando Technology
Jena University Talk 2016.03.09 -- SQL at Zalando TechnologyJena University Talk 2016.03.09 -- SQL at Zalando Technology
Jena University Talk 2016.03.09 -- SQL at Zalando Technology
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart SystemsFIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart Systems
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 

PyCon 2016: Personalised emails with Spark and Python

  • 1. Analytics for the fastest growing companies
  • 2. Personalised emails with Python and Spark Tomáš Sirný, 12. 3. 2016 @junckritter
  • 3. Agenda ● MapReduce ● Introduction to Apache Spark ● Spark + Python ● Use-case: Personalisation of email newsletters
  • 4. About me ● Python developer ● Web development in Django ● Movie search with Elasticsearch ● Data Science
  • 5. Problem of Big Data ● data are “BIG” and on lot of places ● hard and costly to get it all together at once ● slow to process
  • 6. MapReduce ● MR is programming paradigm that allows for massive scalability across hundreds or thousands of servers in a cluster ● do something with every small chunk ● choose only wanted ones ● put them together ● collect & save result
  • 7. Hadoop ● Open-source implementation of Google’s BigTable ● Uses MapReduce ● In version 2 introduced YARN (Yet Another Resource Negotiator) framework for managing resources ● Inputs and results of each phase are saved to files ● Complex configuration of jobs, not easy connection from non-JVM languages
  • 8. Apache Spark Apache Spark™ is a fast and general engine for large-scale data processing ● fast - faster MR than Hadoop, data are in memory ● general engine - multipurpose (data transformation, machine learning, …) ● large-scale - runs in parallel on large clusters ● processing - filter, transform, save or send
  • 9. RDD Resilient Distributed Dataset ● basic data structure in Spark ● immutable distributed collection of objects ● divided into logical partitions ● computed on different nodes of the cluster ● equivalent of table in SQL database
  • 10. Python shell included (spark)tomas@Fenchurch:~/personal$ spark-1.5.2-bin-hadoop2.6/bin/pyspark Python 2.7.10 (default, Oct 23 2015, 18:05:06) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. 16/03/11 22:16:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin- java classes where applicable 16/03/11 22:16:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 16/03/11 22:16:16 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 1.5.2 /_/ Using Python version 2.7.10 (default, Oct 23 2015 18:05:06) SparkContext available as sc, HiveContext available as sqlContext. >>> sc <pyspark.context.SparkContext object at 0x1071bb810> >>> x = sc.parallelize([1, 2, 3, 4, 5]) >>> x.count() 5 >>> x.sortBy(lambda item: item, ascending=False).collect() [5, 4, 3, 2, 1]
  • 11. Run from python in virtualenv os.environ['SPARK_HOME'] = os.path.join(config['spark_home']) os.environ['PYTHONPATH'] = os.path.join(config['spark_home'], 'python') os.environ['PYSPARK_PYTHON'] = config['python'] sys.path.insert( 0, os.path.join(config['spark_home'], "python" ) # pyspark sys.path.insert( 0, os.path.join(config[‘spark_home’], 'python/lib/py4j-0.8.2.1-src.zip') ) #py4j
  • 14. Define SparkContext sc = SparkContext( appName='GenerateEmails', master='yarn-client’, pyFiles='emails.zip' ) zip -r emails.zip * -x "*.pyc" -x "*.log"
  • 15. Simple example def get_delivered_rdd(data, date_str, mailing=None): delivered_rdd_raw = data .filter(lambda x: x['type'] == 'campaign') .filter(lambda x: x['properties']['status'] == 'delivered') .filter(clicked_actual_date(date_str)) if mailing: delivered_rdd_raw = delivered_rdd_raw .filter(lambda x: mailing in x['properties']['mailing']) return delivered_rdd_raw
  • 16. Complex example def get_updated_customers( sc, base_path, project_id, dates='*', user_filter=None, customers=None): if customers is None: customers = sc.emptyRDD() src = os.path.join(base_path, project_id, "update_customer", dates) updates_raw = sc.textFile(src) updates = updates_raw .flatMap(json_to_item) .sortBy(lambda x: x['timestamp']) .filter(lambda x: x is not None) .groupByKey() .map(filter_manual_customers(user_filter)) .filter(lambda x: x is not None) .leftOuterJoin(customers) .map(join_values) return updates.collectAsMap()
  • 17. General rule: do as much as possible in Spark ● count() ● take(n) ● combineByKey() ● distinct() ● countByKey() ● foreach() ● groupBy() ● union(), intersection() ● join(), leftOuterJoin(), ● reduceByKey() ● sortBy(), sortByKey()
  • 19. Requirements: ● Set of defined sections in email - items from categories ● Set of customers subscribed to different newsletters ● Choose best N sections for each customer, based on her activity
  • 21. Event { "type": "add_event", "data": { "timestamp": 1444043432.0, "customer_id": "c1", "type": "campaign", "company_id": "ffe66a48-f341-11e4-8cbf-b083fedeed2e", "properties": { "status": "enqueued", "sections": { "s-test1": ["164377", "157663", "165109", "159075", "153851", "161695"], "s-test2": ["162363", "152249", "162337", "156861", "162109", "165021"], "s-test3": ["115249", "150349", "148291", "148265", "157581", "159479"] } } } }
  • 22. Algorithm S-01 S-02 S-03 S-04 S-05 S-06 S-07 S-08 S-09 S-10 C-01 0.5 0.555 0.23 0.11 0.734 0.93 0.34 0.66 0.85 0.15 C-02 0.4 0.955 0.13 0.76 0.833 0.53 0.74 0.84 0.585 0.45 C-42 0.67 0.555 0.73 0.11 0.234 0.93 0.34 0.66 0.85 0.15 C-43 0.8 0.555 0.33 0.51 0.79 0.43 0.14 0.46 0.55 0.85 C-103 0.9 0.335 0.27 0.11 0.734 0.93 0.34 0.86 0.65 0.15
  • 23. Algorithm S-01 S-02 S-03 S-04 S-05 S-06 S-07 S-08 S-09 S-10 C-01 0.5 0.555 0.23 0.11 0.734 0.93 0.34 0.66 0.85 0.15 C-02 0.4 0.955 0.13 0.76 0.833 0.53 0.74 0.84 0.585 0.45 C-42 0.67 0.555 0.73 0.11 0.234 0.93 0.34 0.66 0.85 0.15 C-43 0.8 0.555 0.33 0.51 0.79 0.43 0.14 0.46 0.55 0.85 C-103 0.9 0.335 0.27 0.11 0.734 0.93 0.34 0.86 0.65 0.15
  • 24. Spark job ● Load data from JSON files on MapR-FS (distributed file-system) ● Create profile of each customer - email address, subscriptions, history of actions (clicks, views, purchases, …) ● Filter by attributes (subscription, preferences) ● Create customer-specific parameters for main algorithm (which sections she clicked, which mails she opened)
  • 25. Python program ● Run Spark job ● Read customer records ● Feed them to main optimisation algorithm ● Generate & send personalised emails
  • 26. Take-aways ● It’s easy to start with Spark through Python ● Few lines of code can process massive data ○ Map-Reduce, SQL, Graphs, Machine Learning ● Quickly becoming standard for data-science and data-science is a future of web applications