Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Solutions for
Marketing Analytics
Natalino Busa
@natalinobusa
Parallelism Hadoop Cassandra Akka
Machine Learning Statistics Big Data
Algorithms Cloud Computing Scala Spray
Natalino Bus...
Humanize Data
The bank statements
Back to routine.
Grocery, broken washmachine
After-vacation fun
Pancake house.
Traveling back.
Just back home. Pizza.
Shop...
Back to routine.
Grocery, broken washmachine
After-vacation fun
Pancake house.
Traveling back.
Just back home. Pizza.
Shop...
data is the fabric of our lives
Let’s give more meaning and context to data.
Abraham Harold Maslow (April 1, 1908 –
June 8, 1970) was an American psychologist
who was best known for creating Maslow's...
breathing, food, water, sleep
security of body, resources,
health, employment, property
friend, family, partner
security o...
How much caring can
technology be?
Connectivity, Electricity, Hardware /
Infra
security of basic operations
REST APIs, Encryption, Authentication
Notificatio...
Data science top 3
Dimensionality
Reduction
Predictive
Analytics
Clustering
Segmentation
Data science: what’s working?
- Random Forests
- Artificial Neural Networks
- Clustering Algorithms
- Pattern Recognition
...
Data science ^.^/
keep it scientific
cross-validate your models
keep it measurable
play with it
create new features
explor...
How to code data science?
# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
● Language for s...
>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTree...
Earn the trust
The customer’s context
Personal history:
amount of transactions ever done
Long term Interaction:
how the users’ action cor...
The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: e...
Challenges
Not much time to react
Events must be delivered fast to the new machine APIs
It’s Web, and Mobile Apps: latency...
Big Data and Fast data
ranking and preference
segmentation and clustering
short term trending topics
rule-based recommenda...
Back to the drawing board
core banking systems
SOAP
services
and DBs
System
BUS
customer
facing appls
channels
A high-level bank schematic
Higher
separation !
Less silos
Interactions
with core
systems
Bigger and Faster
Human-centric applications
Some techs
Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computin...
Cassandra: A low-latency 2D store
Reliable
Distributed, Replicated File System
Low latency
Sub msec. read/write operations...
Scala / Akka / Spray:
a WEB API reactive framework
Actor
A Actor
B
Actor
C
msg 1
msg 2
msg 3
msg 4
● it scales horizontall...
Putting it all together
Hadoop
application (actor based)
millions of millions of
λ=
conversions
( lamda )
Data queues
Science & Engineering
Statistics,
Data Science
Python
R
Visualization
IT Infra
Big Data
Java
Scala
SQL
Hadoop: Big Data In...
Some lessons learned
● Mix and match technologies is a good thing
● Fast Data must complement Big Data
● Ease integration ...
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Bus...
Upcoming SlideShare
Loading in …5
×

Big data solutions for advanced marketing analytics

1,289 views

Published on

Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. By using Hadoop and open source statistical language and tools such R and Python, we can execute a variety of machine learning algorithms, and scale them out on a distributed computing framework.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Big data solutions for advanced marketing analytics

  1. 1. Big Data Solutions for Marketing Analytics Natalino Busa @natalinobusa
  2. 2. Parallelism Hadoop Cassandra Akka Machine Learning Statistics Big Data Algorithms Cloud Computing Scala Spray Natalino Busa @natalinobusa www.natalinobusa.com
  3. 3. Humanize Data
  4. 4. The bank statements
  5. 5. Back to routine. Grocery, broken washmachine After-vacation fun Pancake house. Traveling back. Just back home. Pizza. Shopping in Sicily Vacation! The bank statements How I read the bank bills
  6. 6. Back to routine. Grocery, broken washmachine After-vacation fun Pancake house. Traveling back. Just back home. Pizza. Shopping in Sicily Vacation! The bank statements How I read the bank bills What happened those days
  7. 7. data is the fabric of our lives Let’s give more meaning and context to data.
  8. 8. Abraham Harold Maslow (April 1, 1908 – June 8, 1970) was an American psychologist who was best known for creating Maslow's hierarchy of needs
  9. 9. breathing, food, water, sleep security of body, resources, health, employment, property friend, family, partner security of love and belonging self-esteem, confidence, achievements, respect spontaneity, creativity, acceptance, freedom, ethics Physiology Contractual Love & Caring Esteem Self-actualization Very human needs
  10. 10. How much caring can technology be?
  11. 11. Connectivity, Electricity, Hardware / Infra security of basic operations REST APIs, Encryption, Authentication Notification, Alerts, Social bonding, Predictions Set goals, planning, Achievements, Advisory role Freedom, Trusted Companion Physiology Contractual Love & Caring Esteem Self-actualization Technology is reaching out
  12. 12. Data science top 3 Dimensionality Reduction Predictive Analytics Clustering Segmentation
  13. 13. Data science: what’s working? - Random Forests - Artificial Neural Networks - Clustering Algorithms - Pattern Recognition - Time-Serie analysis - Regression Most actual models are a combination of these ones
  14. 14. Data science ^.^/ keep it scientific cross-validate your models keep it measurable play with it create new features explore the available data
  15. 15. How to code data science?
  16. 16. # Multiple Linear Regression Example fit <- lm(y ~ x1 + x2 + x3, data=mydata) summary(fit) # show results ● Language for statistics ● Easy to Analyze and shape data ● Advanced statistical package ● Fueled by academia and professionals ● Very clean visualization packages Packages for machine learning time serie forecasting, clustering, classification decision trees, neural networks Remote procedure calls (RPC) From scala/java via RProcess and Rserve Data Science: R
  17. 17. >>> from sklearn.datasets import load_iris >>> from sklearn import tree >>> iris = load_iris() >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(iris.data, iris.target) ● Flexible, concise language ● Quick to code and prototype ● Portable, visualization libraries Machine learning libraries: scipy, statsmodels, sklearn, matplotlib, ipython Web libraries flask, tornado, (no)SQL clients Data Science: Python
  18. 18. Earn the trust
  19. 19. The customer’s context Personal history: amount of transactions ever done Long term Interaction: how the users’ action correlate with others Real time events: Trends and recent events
  20. 20. The customer’s context context is related to time: slow changing: the defining characteristic of a person fast changing: events which influence our lives, trends Require very different technology solutions !!!
  21. 21. Challenges Not much time to react Events must be delivered fast to the new machine APIs It’s Web, and Mobile Apps: latency budget is limited Loads of information to process Understand well the user history Access a larger context
  22. 22. Big Data and Fast data ranking and preference segmentation and clustering short term trending topics rule-based recommendations 10’s Terabytes of Data. This can take hours …. 100’s of events per second. This must be fast ….
  23. 23. Back to the drawing board
  24. 24. core banking systems SOAP services and DBs System BUS customer facing appls channels A high-level bank schematic
  25. 25. Higher separation ! Less silos Interactions with core systems Bigger and Faster
  26. 26. Human-centric applications
  27. 27. Some techs
  28. 28. Hadoop: Distributed Data OS Reliable Distributed, Replicated File System Low cost ↓ Cost vs ↑ Performance/Storage Computing Powerhouse All clusters CPU’s working in parallel for running queries
  29. 29. Cassandra: A low-latency 2D store Reliable Distributed, Replicated File System Low latency Sub msec. read/write operations Tunable CAP Define your level of consistency Data model: hashed rows, sorted wide columns Architecture model: No SPOF, ring of nodes, omogeneous system
  30. 30. Scala / Akka / Spray: a WEB API reactive framework Actor A Actor B Actor C msg 1 msg 2 msg 3 msg 4 ● it scales horizontally (can run in cluster mode) ● maximum use of the available cores/memory ● processing is non-blocking, threads are re-used ● can parallelize computing power across many actors Very fast: 1000’s messages/sec Very reliable: auto recovery Lazy: compute only when required
  31. 31. Putting it all together Hadoop application (actor based) millions of millions of λ= conversions ( lamda ) Data queues
  32. 32. Science & Engineering Statistics, Data Science Python R Visualization IT Infra Big Data Java Scala SQL Hadoop: Big Data Infrastructure, Data Science on large datasets Big Data and Fast Data requires different profiles to be able to achieve the best results
  33. 33. Some lessons learned ● Mix and match technologies is a good thing ● Fast Data must complement Big Data ● Ease integration among teams ● Hadoop, Cassandra, and Akka ● Data Science takes time to figure out
  34. 34. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com Thanks ! Any questions?

×