Big data solutions for advanced marketing analytics

1,185 views

Published on

Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. By using Hadoop and open source statistical language and tools such R and Python, we can execute a variety of machine learning algorithms, and scale them out on a distributed computing framework.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,185
On SlideShare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data solutions for advanced marketing analytics

  1. 1. Big Data Solutions for Marketing Analytics Natalino Busa @natalinobusa
  2. 2. Parallelism Hadoop Cassandra Akka Machine Learning Statistics Big Data Algorithms Cloud Computing Scala Spray Natalino Busa @natalinobusa www.natalinobusa.com
  3. 3. Humanize Data
  4. 4. The bank statements
  5. 5. Back to routine. Grocery, broken washmachine After-vacation fun Pancake house. Traveling back. Just back home. Pizza. Shopping in Sicily Vacation! The bank statements How I read the bank bills
  6. 6. Back to routine. Grocery, broken washmachine After-vacation fun Pancake house. Traveling back. Just back home. Pizza. Shopping in Sicily Vacation! The bank statements How I read the bank bills What happened those days
  7. 7. data is the fabric of our lives Let’s give more meaning and context to data.
  8. 8. Abraham Harold Maslow (April 1, 1908 – June 8, 1970) was an American psychologist who was best known for creating Maslow's hierarchy of needs
  9. 9. breathing, food, water, sleep security of body, resources, health, employment, property friend, family, partner security of love and belonging self-esteem, confidence, achievements, respect spontaneity, creativity, acceptance, freedom, ethics Physiology Contractual Love & Caring Esteem Self-actualization Very human needs
  10. 10. How much caring can technology be?
  11. 11. Connectivity, Electricity, Hardware / Infra security of basic operations REST APIs, Encryption, Authentication Notification, Alerts, Social bonding, Predictions Set goals, planning, Achievements, Advisory role Freedom, Trusted Companion Physiology Contractual Love & Caring Esteem Self-actualization Technology is reaching out
  12. 12. Data science top 3 Dimensionality Reduction Predictive Analytics Clustering Segmentation
  13. 13. Data science: what’s working? - Random Forests - Artificial Neural Networks - Clustering Algorithms - Pattern Recognition - Time-Serie analysis - Regression Most actual models are a combination of these ones
  14. 14. Data science ^.^/ keep it scientific cross-validate your models keep it measurable play with it create new features explore the available data
  15. 15. How to code data science?
  16. 16. # Multiple Linear Regression Example fit <- lm(y ~ x1 + x2 + x3, data=mydata) summary(fit) # show results ● Language for statistics ● Easy to Analyze and shape data ● Advanced statistical package ● Fueled by academia and professionals ● Very clean visualization packages Packages for machine learning time serie forecasting, clustering, classification decision trees, neural networks Remote procedure calls (RPC) From scala/java via RProcess and Rserve Data Science: R
  17. 17. >>> from sklearn.datasets import load_iris >>> from sklearn import tree >>> iris = load_iris() >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(iris.data, iris.target) ● Flexible, concise language ● Quick to code and prototype ● Portable, visualization libraries Machine learning libraries: scipy, statsmodels, sklearn, matplotlib, ipython Web libraries flask, tornado, (no)SQL clients Data Science: Python
  18. 18. Earn the trust
  19. 19. The customer’s context Personal history: amount of transactions ever done Long term Interaction: how the users’ action correlate with others Real time events: Trends and recent events
  20. 20. The customer’s context context is related to time: slow changing: the defining characteristic of a person fast changing: events which influence our lives, trends Require very different technology solutions !!!
  21. 21. Challenges Not much time to react Events must be delivered fast to the new machine APIs It’s Web, and Mobile Apps: latency budget is limited Loads of information to process Understand well the user history Access a larger context
  22. 22. Big Data and Fast data ranking and preference segmentation and clustering short term trending topics rule-based recommendations 10’s Terabytes of Data. This can take hours …. 100’s of events per second. This must be fast ….
  23. 23. Back to the drawing board
  24. 24. core banking systems SOAP services and DBs System BUS customer facing appls channels A high-level bank schematic
  25. 25. Higher separation ! Less silos Interactions with core systems Bigger and Faster
  26. 26. Human-centric applications
  27. 27. Some techs
  28. 28. Hadoop: Distributed Data OS Reliable Distributed, Replicated File System Low cost ↓ Cost vs ↑ Performance/Storage Computing Powerhouse All clusters CPU’s working in parallel for running queries
  29. 29. Cassandra: A low-latency 2D store Reliable Distributed, Replicated File System Low latency Sub msec. read/write operations Tunable CAP Define your level of consistency Data model: hashed rows, sorted wide columns Architecture model: No SPOF, ring of nodes, omogeneous system
  30. 30. Scala / Akka / Spray: a WEB API reactive framework Actor A Actor B Actor C msg 1 msg 2 msg 3 msg 4 ● it scales horizontally (can run in cluster mode) ● maximum use of the available cores/memory ● processing is non-blocking, threads are re-used ● can parallelize computing power across many actors Very fast: 1000’s messages/sec Very reliable: auto recovery Lazy: compute only when required
  31. 31. Putting it all together Hadoop application (actor based) millions of millions of λ= conversions ( lamda ) Data queues
  32. 32. Science & Engineering Statistics, Data Science Python R Visualization IT Infra Big Data Java Scala SQL Hadoop: Big Data Infrastructure, Data Science on large datasets Big Data and Fast Data requires different profiles to be able to achieve the best results
  33. 33. Some lessons learned ● Mix and match technologies is a good thing ● Fast Data must complement Big Data ● Ease integration among teams ● Hadoop, Cassandra, and Akka ● Data Science takes time to figure out
  34. 34. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com Thanks ! Any questions?

×