Continuous Analytics & Optimisation
Use cases and examples using Apache Spark
Michael Cutler @ TUMRA – January 2015
Hello
•  Early adopter of Hadoop
•  Spoke at Hadoop World on
machine learning
•  Twitter: @cotdp
About Me
We use Data Science and Big Data
technology to help ecommerce
companies understand their
customers and increase sales.
TUMRA
•  Slide are on Slideshare
•  Code example on Github
•  Twitter: @tumra
This Talk
Example Use Case3
Introducing Apache Spark2
Background1
Background1
Clickstream & Social Media Analysis
A generalised approach
Mobile/Tablet App
Data
Collection
Data
Processing
Reporting &
Analysis
Web Site
You
People
Social Network
Events Files Tables
Basic Architecture
Three things we want to do
•  Collect data continuously
•  Various input sources
•  Lots of “unstructured” data
Data Collection
•  Summarise the data, counts
and distributions
•  Alerting on outliers
Data Processing
•  Time-series
•  Trends over time
•  Filtering/segmenting
Reporting
How has this approach evolved?
Rapidly reducing the ‘time to insight’
•  Proprietary & Expensive
•  Slow Constrained
Time to Insight
48+ hours
pre-Historic Hadoop
•  Open-source & Inexpensive
•  Flexible but complex to use
Time to Insight
hours
2008 - Hadoop
•  Batch, Streaming & Interactive
•  Fast & Easy to use
Time to Insight
minutes
2014 - Spark
Weaving a story from a string of activities
Understanding the shoppers journey
Day #0
PPC long-tail
keyword
Day #7 Day #10 Day #13 Day #17
PPC brand keyword &
signed up email
Opened Email
Newsletter on iPad PPC brand
keyword
Add To Cart
Order
Placed
It’s all about People & Products
Not just boring log files!
Turn low-level events like “Page Views” into something meaningful
e.g. <Person1234> <viewed-a> <Product:Camera>
Bought a …
Activity & Interactions
Measuring the degree of interest a Person has about a Product
e.g. are 10 views for a certain Product a good or bad thing?
Gauging Interest
Either inferred from other Peoples activities, or Product similarity
Affinities
Both people and products have properties,
e.g. <Person1234> <is:gender> <Female>
Properties
People & Product Interactions
e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
Source: Snowplow Analytics
That sounds like a Graph …
Use graphs to understand user intent
Interest Graph Visualisation
•  Collect user activity data in real-time, not just
clicks but mouse-overs, images, video, social.
•  Algorithms identify products, categories and
brands a particular person is interested in.
•  Cluster users into ‘neighborhoods’ to infer what to
show to existing and future visitors.
This visualization illustrates just 1% of 6 weeks visitor
activity data. Blue data points are People, Orange
data points are Products.
Introducing Apache Spark2
Revisiting the requirements
Three things we want to do
•  Apache Kafka
•  Apache Flume
•  Files/Sockets
Data Collection
•  Apache Spark
•  Apache Hadoop
•  Storm
Data Processing
•  Apache Cassandra
•  RDBMS
•  MongoDB, etc. etc.
Reporting & Analytics
Why … ?
There are lots of ways to solve it, but here is the best way
•  Distributed
•  Fault-tolerant
•  Scalable
•  Streaming
•  Machine-learning
•  Java/Scala/Python bindings
etc. etc.
•  Fast random-access to any Row
•  Range-scanning through millions
of columns on a single row
Data Collection Data Processing Reporting & Analytics
Three reasons Apache Spark is awesome!
Apart from “no more Java Map/Reduce code!!!”
•  In-memory Caching
•  DAG execution optimisation
•  Easy to use in Scala, Java, Python
Fast
•  Machine Learning baked in
•  Graph algorithms
•  Interactive Shell
Smart
•  Query from Spark SQL
•  Streaming
•  Batch (file based)
Flexible
Apache Spark
Architecture Overview
Apache ZooKeeper
Hadoop Filesystem
(HDFS)
Yarn / Mesos
(optional)
Apache Spark
Coexists with your existing Hadoop Infrastructure
Apache ZooKeeper
Hadoop Filesystem (HDFS)
Map / Reduce
Apache Hive etc.
Yarn / Mesos
Apache Spark can …
Simple example of Spark SQL used from Scala
Source: Databricks
Go from a SQL query…
… to a trained machine learning
model in three lines of code.
Example Use Case3
Example Architecture
Coexists with your existing Hadoop Infrastructure
Apache ZooKeeper
Hadoop Filesystem (HDFS)
NoSQL Store
(Cassandra)
Reporting
Dashboard
Apache Kafka
Analytics
Jobs
Spark Streaming
Processing DStreams
Cassandra Schema
For storing time-series data
Use a Compound Key:
•  metric name e.g. “Clicks”
•  metric grain e.g. “M” – minutely
•  metric dimensions e.g. “device=mobile&gender=male”
•  timestamp e.g. “2015-01-29 14:30:00.000” (bucketed)
Storing the value:
•  counters – work well in some cases, have limitations (no reset)
•  integers – if in doubt, just use integers (bigint)
How Cassandra Stores the Data
For storing time-series data
•  Uses one row per ‘compound key’ (name,grain,dimension,time_bucket)
•  Time-series data is stored in the columns of this row
•  Use TTL support to expire old fine-grained data e.g. “minutely expires
after 30 days”, “hourly expires after 90 days”, “daily kept forever”
Source: planetcassandra.org
Social Media Analysis
Converting a low-level event into a meaningful high-level interaction
•  A user-interaction from the
Facebook firehose, received as a
real-time stream of JSON
•  Streamed into Apache Kafka,
also stored in SequenceFiles
•  Modeled into Scala Case Class:
Example - Spark SQL
Using the Spark SQL interface to analyze the data
•  Parse JSON
•  Extract interesting attributes,
transform into Case Classes
•  ‘Register as table’
•  Execute SQL, print results
Example - Spark (Scala)
Using the Spark (Scala) interface to analyze the data
•  Parse JSON
•  Extract interesting attributes
•  ‘Reduce by Key’ to sum the result
•  Print results
Thank you!
Any questions?

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, London 2015-01-29)

  • 1.
    Continuous Analytics &Optimisation Use cases and examples using Apache Spark Michael Cutler @ TUMRA – January 2015
  • 2.
    Hello •  Early adopterof Hadoop •  Spoke at Hadoop World on machine learning •  Twitter: @cotdp About Me We use Data Science and Big Data technology to help ecommerce companies understand their customers and increase sales. TUMRA •  Slide are on Slideshare •  Code example on Github •  Twitter: @tumra This Talk
  • 3.
    Example Use Case3 IntroducingApache Spark2 Background1
  • 4.
  • 5.
    Clickstream & SocialMedia Analysis A generalised approach Mobile/Tablet App Data Collection Data Processing Reporting & Analysis Web Site You People Social Network Events Files Tables
  • 6.
    Basic Architecture Three thingswe want to do •  Collect data continuously •  Various input sources •  Lots of “unstructured” data Data Collection •  Summarise the data, counts and distributions •  Alerting on outliers Data Processing •  Time-series •  Trends over time •  Filtering/segmenting Reporting
  • 7.
    How has thisapproach evolved? Rapidly reducing the ‘time to insight’ •  Proprietary & Expensive •  Slow Constrained Time to Insight 48+ hours pre-Historic Hadoop •  Open-source & Inexpensive •  Flexible but complex to use Time to Insight hours 2008 - Hadoop •  Batch, Streaming & Interactive •  Fast & Easy to use Time to Insight minutes 2014 - Spark
  • 8.
    Weaving a storyfrom a string of activities Understanding the shoppers journey Day #0 PPC long-tail keyword Day #7 Day #10 Day #13 Day #17 PPC brand keyword & signed up email Opened Email Newsletter on iPad PPC brand keyword Add To Cart Order Placed
  • 9.
    It’s all aboutPeople & Products Not just boring log files! Turn low-level events like “Page Views” into something meaningful e.g. <Person1234> <viewed-a> <Product:Camera> Bought a … Activity & Interactions Measuring the degree of interest a Person has about a Product e.g. are 10 views for a certain Product a good or bad thing? Gauging Interest Either inferred from other Peoples activities, or Product similarity Affinities Both people and products have properties, e.g. <Person1234> <is:gender> <Female> Properties
  • 10.
    People & ProductInteractions e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch” Source: Snowplow Analytics
  • 11.
    That sounds likea Graph … Use graphs to understand user intent Interest Graph Visualisation •  Collect user activity data in real-time, not just clicks but mouse-overs, images, video, social. •  Algorithms identify products, categories and brands a particular person is interested in. •  Cluster users into ‘neighborhoods’ to infer what to show to existing and future visitors. This visualization illustrates just 1% of 6 weeks visitor activity data. Blue data points are People, Orange data points are Products.
  • 12.
  • 13.
    Revisiting the requirements Threethings we want to do •  Apache Kafka •  Apache Flume •  Files/Sockets Data Collection •  Apache Spark •  Apache Hadoop •  Storm Data Processing •  Apache Cassandra •  RDBMS •  MongoDB, etc. etc. Reporting & Analytics
  • 14.
    Why … ? Thereare lots of ways to solve it, but here is the best way •  Distributed •  Fault-tolerant •  Scalable •  Streaming •  Machine-learning •  Java/Scala/Python bindings etc. etc. •  Fast random-access to any Row •  Range-scanning through millions of columns on a single row Data Collection Data Processing Reporting & Analytics
  • 15.
    Three reasons ApacheSpark is awesome! Apart from “no more Java Map/Reduce code!!!” •  In-memory Caching •  DAG execution optimisation •  Easy to use in Scala, Java, Python Fast •  Machine Learning baked in •  Graph algorithms •  Interactive Shell Smart •  Query from Spark SQL •  Streaming •  Batch (file based) Flexible
  • 16.
    Apache Spark Architecture Overview ApacheZooKeeper Hadoop Filesystem (HDFS) Yarn / Mesos (optional)
  • 17.
    Apache Spark Coexists withyour existing Hadoop Infrastructure Apache ZooKeeper Hadoop Filesystem (HDFS) Map / Reduce Apache Hive etc. Yarn / Mesos
  • 18.
    Apache Spark can… Simple example of Spark SQL used from Scala Source: Databricks Go from a SQL query… … to a trained machine learning model in three lines of code.
  • 19.
  • 20.
    Example Architecture Coexists withyour existing Hadoop Infrastructure Apache ZooKeeper Hadoop Filesystem (HDFS) NoSQL Store (Cassandra) Reporting Dashboard Apache Kafka Analytics Jobs
  • 21.
  • 22.
    Cassandra Schema For storingtime-series data Use a Compound Key: •  metric name e.g. “Clicks” •  metric grain e.g. “M” – minutely •  metric dimensions e.g. “device=mobile&gender=male” •  timestamp e.g. “2015-01-29 14:30:00.000” (bucketed) Storing the value: •  counters – work well in some cases, have limitations (no reset) •  integers – if in doubt, just use integers (bigint)
  • 23.
    How Cassandra Storesthe Data For storing time-series data •  Uses one row per ‘compound key’ (name,grain,dimension,time_bucket) •  Time-series data is stored in the columns of this row •  Use TTL support to expire old fine-grained data e.g. “minutely expires after 30 days”, “hourly expires after 90 days”, “daily kept forever” Source: planetcassandra.org
  • 24.
    Social Media Analysis Convertinga low-level event into a meaningful high-level interaction •  A user-interaction from the Facebook firehose, received as a real-time stream of JSON •  Streamed into Apache Kafka, also stored in SequenceFiles •  Modeled into Scala Case Class:
  • 25.
    Example - SparkSQL Using the Spark SQL interface to analyze the data •  Parse JSON •  Extract interesting attributes, transform into Case Classes •  ‘Register as table’ •  Execute SQL, print results
  • 26.
    Example - Spark(Scala) Using the Spark (Scala) interface to analyze the data •  Parse JSON •  Extract interesting attributes •  ‘Reduce by Key’ to sum the result •  Print results
  • 27.