SlideShare a Scribd company logo
Clickstream & Social Media Analysis 
Use cases and examples using Apache Spark 
Michael Cutler @ TUMRA – November 2014
Hello 
About Me 
• Early adopter of Hadoop 
• Spoke at Hadoop World on 
machine learning 
• Twitter: @cotdp 
TUMRA 
We use Data Science and Big Data 
technology to help ecommerce 
companies understand their 
customers and increase sales. 
This Talk 
• Slide are on Slideshare 
• Code example on Github 
• Twitter: @tumra
1 Background 
2 Introducing Apache Spark 
3 Examples
1 Background
Clickstream & Social Media Analysis 
Generalised Approach 
Mobile/Tablet App 
Data 
Collection 
Data 
Processing 
Reporting & 
Analysis 
Web Site 
You 
People 
Social Network 
Events Files Tables
How has this approach evolved? 
Rapidly reducing the ‘time to insight’ 
pre-Historic Hadoop 
• Proprietary & Expensive 
• Slow Constrained 
Time to Insight 
48+ hours 
2008 - Hadoop 
• Open-source & Inexpensive 
• Flexible but complex to use 
Time to Insight 
hours 
2014 - Spark 
• Batch, Streaming & Interactive 
• Fast & Easy to use 
Time to Insight 
minutes
Weaving a story from a string of activities 
Understanding the shoppers journey 
PPC long-tail 
keyword 
Day #0 
Opened Email 
Newsletter on iPad PPC brand 
PPC brand keyword & 
signed up email 
keyword 
Add To Cart 
Order 
Placed 
Day #7 Day #10 Day #13 Day #17
Shopper Journey 
Understanding the shoppers journey 
Time 
Shopper 
Consumer 
Research Consideration Purchase 
Need
It’s all about People & Products 
Not just boring log files! 
Turn low-level events like “Page Views” into something meaningful 
e.g. <Person1234> <viewed-a> <Product:Camera> 
Bought a … 
Activity & Interactions 
Gauging Interest 
Measuring the degree of interest a Person has about a Product 
e.g. are 10 views for a certain Product a good or bad thing? 
Affinities 
Either inferred from other Peoples activities, or Product similarity 
Properties 
Both people and products have properties, 
e.g. <Person1234> <is:gender> <Female>
People & Product Interactions 
Source: Snowplow Analytics 
e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
That sounds like a Graph … 
Use graphs to understand user intent 
Interest Graph Visualisation 
• Collect user activity data in real-time, not just 
clicks but mouse-overs, images, video, social. 
• Algorithms identify products, categories and 
brands a particular person is interested in. 
• Cluster users into ‘neighborhoods’ to infer what to 
show to existing and future visitors. 
This visualization illustrates just 1% of 6 weeks visitor 
activity data. Blue data points are People, Orange 
data points are Products.
Introducing 2 Apache Spark
Three reasons Apache Spark is awesome! 
Apart from “no more Java Map/Reduce code!!!” 
Fast 
• In-memory Caching 
• DAG execution optimisation 
• Easy to use in Scala, Java, Python 
Smart 
• Machine Learning baked in 
• Graph algorithms 
• Interactive Shell 
Flexible 
• Query from Spark SQL 
• Streaming 
• Batch (file based)
Apache Spark 
Architecture Overview 
Apache ZooKeeper Hadoop Filesystem 
(HDFS) 
Yarn / Mesos 
(optional)
Apache Spark 
Coexists with your existing Hadoop Infrastructure 
Hadoop Filesystem (HDFS) 
Apache ZooKeeper 
Apache Hive etc. 
Map / Reduce 
Yarn / Mesos
Apache Spark can … 
Simple example of Spark SQL used from Scala 
Source: Databricks 
Go from a SQL query… 
… to a trained machine learning 
model in three lines of code.
3 Examples
Example Architecture 
Coexists with your existing Hadoop Infrastructure 
Reporting 
Dashboard 
Hadoop Filesystem (HDFS) NoSQL Store 
Apache ZooKeeper 
(Cassandra) 
Apache Kafka 
Analytics 
Jobs
Social Media Analysis 
Converting a low-level event into a meaningful high-level interaction 
• A user-interaction from the 
Facebook firehose, received as a 
real-time stream of JSON 
• Streamed into Apache Kafka, 
also stored in SequenceFiles 
• Modeled into Scala Case Class:
Example - Spark (Scala) 
Using the Spark (Scala) interface to analyze the data 
• Parse JSON 
• Extract interesting attributes 
• ‘Reduce by Key’ to sum the result 
• Print results
Example - Spark SQL 
Using the Spark SQL interface to analyze the data 
• Parse JSON 
• Extract interesting attributes, 
transform into Case Classes 
• ‘Register as table’ 
• Execute SQL, print results
Want to play with awesome tech and data? 
We’re hiring! team@tumra.com 
Data Engineer 
Scala, functional programming, 
Hadoop, NoSQL 
Sales & Marketing 
Experience with SaaS and ecommerce sales

More Related Content

What's hot

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Web Services
 
Server Monitoring from the Cloud
Server Monitoring from the CloudServer Monitoring from the Cloud
Server Monitoring from the Cloud
Site24x7
 
Amazon EFS
Amazon EFSAmazon EFS
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
confluent
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
Fernando Rodriguez
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
Amazon Web Services
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
Kris Buytaert
 
SIEM (Security Information and Event Management)
SIEM (Security Information and Event Management)SIEM (Security Information and Event Management)
SIEM (Security Information and Event Management)
Osama Ellahi
 
SIEM - Activating Defense through Response by Ankur Vats
SIEM - Activating Defense through Response by Ankur VatsSIEM - Activating Defense through Response by Ankur Vats
SIEM - Activating Defense through Response by Ankur Vats
OWASP Delhi
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
Avinash Ramineni
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...
Amazon Web Services
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
Knoldus Inc.
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 

What's hot (20)

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Server Monitoring from the Cloud
Server Monitoring from the CloudServer Monitoring from the Cloud
Server Monitoring from the Cloud
 
Amazon EFS
Amazon EFSAmazon EFS
Amazon EFS
 
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
Data in Motion을 위한 이벤트 기반 마이크로서비스 아키텍처 소개
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
SIEM (Security Information and Event Management)
SIEM (Security Information and Event Management)SIEM (Security Information and Event Management)
SIEM (Security Information and Event Management)
 
SIEM - Activating Defense through Response by Ankur Vats
SIEM - Activating Defense through Response by Ankur VatsSIEM - Activating Defense through Response by Ankur Vats
SIEM - Activating Defense through Response by Ankur Vats
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 

Similar to Clickstream & Social Media Analysis using Apache Spark

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
Snowplow Analytics
 
Analytics With PowerBI On Azure
Analytics With PowerBI On AzureAnalytics With PowerBI On Azure
Analytics With PowerBI On Azure
Anita Luthra
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Lucas Jellema
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
Jongwook Woo
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
Nishant Gandhi
 
Big data and AI in Socialbakers
Big data and AI in SocialbakersBig data and AI in Socialbakers
Big data and AI in Socialbakers
ppetr82
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks
 
Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0Isaac Mosquera
 
Splunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop SummitSplunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop Summit
Isaac Mosquera
 
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the EdgeBig Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
DataWorks Summit
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
Ana Rebelo
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Open Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream ProcessingOpen Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream Processing
Grid Dynamics
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
 

Similar to Clickstream & Social Media Analysis using Apache Spark (20)

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
 
Analytics With PowerBI On Azure
Analytics With PowerBI On AzureAnalytics With PowerBI On Azure
Analytics With PowerBI On Azure
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Big data and AI in Socialbakers
Big data and AI in SocialbakersBig data and AI in Socialbakers
Big data and AI in Socialbakers
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0
 
Splunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop SummitSplunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop Summit
 
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the EdgeBig Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Open Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream ProcessingOpen Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream Processing
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Clickstream & Social Media Analysis using Apache Spark

  • 1. Clickstream & Social Media Analysis Use cases and examples using Apache Spark Michael Cutler @ TUMRA – November 2014
  • 2. Hello About Me • Early adopter of Hadoop • Spoke at Hadoop World on machine learning • Twitter: @cotdp TUMRA We use Data Science and Big Data technology to help ecommerce companies understand their customers and increase sales. This Talk • Slide are on Slideshare • Code example on Github • Twitter: @tumra
  • 3. 1 Background 2 Introducing Apache Spark 3 Examples
  • 5. Clickstream & Social Media Analysis Generalised Approach Mobile/Tablet App Data Collection Data Processing Reporting & Analysis Web Site You People Social Network Events Files Tables
  • 6. How has this approach evolved? Rapidly reducing the ‘time to insight’ pre-Historic Hadoop • Proprietary & Expensive • Slow Constrained Time to Insight 48+ hours 2008 - Hadoop • Open-source & Inexpensive • Flexible but complex to use Time to Insight hours 2014 - Spark • Batch, Streaming & Interactive • Fast & Easy to use Time to Insight minutes
  • 7. Weaving a story from a string of activities Understanding the shoppers journey PPC long-tail keyword Day #0 Opened Email Newsletter on iPad PPC brand PPC brand keyword & signed up email keyword Add To Cart Order Placed Day #7 Day #10 Day #13 Day #17
  • 8. Shopper Journey Understanding the shoppers journey Time Shopper Consumer Research Consideration Purchase Need
  • 9. It’s all about People & Products Not just boring log files! Turn low-level events like “Page Views” into something meaningful e.g. <Person1234> <viewed-a> <Product:Camera> Bought a … Activity & Interactions Gauging Interest Measuring the degree of interest a Person has about a Product e.g. are 10 views for a certain Product a good or bad thing? Affinities Either inferred from other Peoples activities, or Product similarity Properties Both people and products have properties, e.g. <Person1234> <is:gender> <Female>
  • 10. People & Product Interactions Source: Snowplow Analytics e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
  • 11. That sounds like a Graph … Use graphs to understand user intent Interest Graph Visualisation • Collect user activity data in real-time, not just clicks but mouse-overs, images, video, social. • Algorithms identify products, categories and brands a particular person is interested in. • Cluster users into ‘neighborhoods’ to infer what to show to existing and future visitors. This visualization illustrates just 1% of 6 weeks visitor activity data. Blue data points are People, Orange data points are Products.
  • 13. Three reasons Apache Spark is awesome! Apart from “no more Java Map/Reduce code!!!” Fast • In-memory Caching • DAG execution optimisation • Easy to use in Scala, Java, Python Smart • Machine Learning baked in • Graph algorithms • Interactive Shell Flexible • Query from Spark SQL • Streaming • Batch (file based)
  • 14. Apache Spark Architecture Overview Apache ZooKeeper Hadoop Filesystem (HDFS) Yarn / Mesos (optional)
  • 15. Apache Spark Coexists with your existing Hadoop Infrastructure Hadoop Filesystem (HDFS) Apache ZooKeeper Apache Hive etc. Map / Reduce Yarn / Mesos
  • 16. Apache Spark can … Simple example of Spark SQL used from Scala Source: Databricks Go from a SQL query… … to a trained machine learning model in three lines of code.
  • 18. Example Architecture Coexists with your existing Hadoop Infrastructure Reporting Dashboard Hadoop Filesystem (HDFS) NoSQL Store Apache ZooKeeper (Cassandra) Apache Kafka Analytics Jobs
  • 19. Social Media Analysis Converting a low-level event into a meaningful high-level interaction • A user-interaction from the Facebook firehose, received as a real-time stream of JSON • Streamed into Apache Kafka, also stored in SequenceFiles • Modeled into Scala Case Class:
  • 20. Example - Spark (Scala) Using the Spark (Scala) interface to analyze the data • Parse JSON • Extract interesting attributes • ‘Reduce by Key’ to sum the result • Print results
  • 21. Example - Spark SQL Using the Spark SQL interface to analyze the data • Parse JSON • Extract interesting attributes, transform into Case Classes • ‘Register as table’ • Execute SQL, print results
  • 22. Want to play with awesome tech and data? We’re hiring! team@tumra.com Data Engineer Scala, functional programming, Hadoop, NoSQL Sales & Marketing Experience with SaaS and ecommerce sales