LUISS - Deep Learning and data analyses - 09/01/19

Data Analysis, Mobility,
Proximity and App-based
Marketing
Deep Learning and data analyses
A new perspective
on how data support companies
on strategic decisions.
Presenter
Alberto Paro
Date 09/01/19

Ø Master Degree in Computer Science Engineering at Politecnico di
Milano
Ø Author of 3 books about ElasticSearch from 1 to 5.x + 6 Tech
reviews
Ø Big Data Trainer, Developer and Consulting on Big data
Technologies (Akka, Playframework, Apache Spark, Reactive
Programming), NoSQL (Accumulo, Hbase, Cassandra,
ElasticSearch, Kafka and MongoDB) and Machine Learning
applied to Big Data.
Ø Evangelist for Scala e Scala.JS Language
ABOUT ME – ALBERTO PARO

Ø Big Data Concepts
Ø
Ø
Ø Market position
Ø
Ø
Ø
Ø
Ø Build a Solution for
Intelligence
Ø
Ø
Ø
Ø
Ø
Ø
3
TOPICS

The ‘Datafication’
Ø Activity
Ø Conversation
Ø Text
Ø Voice
Ø Social Media
Ø Browser logs
Ø Photos
Ø Videos
Ø IOT
Ø Etc.
Volume
Veracity
Variety
Velocity
Big Data Analysing:
Ø Text analytics
Ø Sentiment
analysis
Ø Face recognition
Ø Voice analytics
Ø Movement
analytics
Ø Etc.
Value
TRANSFORM BIG DATA IN VALUE

ARTIFICIAL INTELLIGENCE/MACHINE LEARNING

ARTIFICIAL INTELLIGENCE: AI
The ability of a
machine to replicate
intelligent human
behavior

MACHINE LEARNING
The ability to improve
performance of a task
progressively,
without being
explicitly
programmed

4 BIG IDEAS
Data Driven Decision Making
Cloud Computing
Machine Learning
Cognitive
Computing:
ML + BigData + NLP

OUTLOOK
Worldwide Spending on
Cognitive and Artificial
Intelligence Systems reached
about $19.1 Billion in 2018
Source: IDC
40% of Digital Transformation
initiatives will use AI services.
AI spending will grow to $42.2
Billion in 2021.
Source: IDC

AI SPENDING
Retail $3.4B
Banking $3.3BManufactoring $2B
Healthcare $1.7B

Ø Customer Raccomandation
Ø Customer Profiling
Ø
Ø
Ø
Ø Customer Pre Selling
Ø
Ø
Ø Customer Post Selling
Ø
Ø
Ø Froud Detection
Ø Prediction Systems for Brookers (banking/finance)
AI TECHNOLOGIES – RETAIL AND BANKING

Ø Cost reduction via robots
Ø Creation of new products
Ø
Ø Quality monitoring
Ø
Ø Learning by example
Ø Predictive Maintainance
Ø
AI TECHNOLOGIES – MANUFACTORING

WHERE ARE COMPANIES SPENDING?
CLOUD

Data
Science
Teams

ML
Tools

Deep
Learning
Microservices
Mesh

Lots of
Proof-
Of-
Concepts

TRACTION
Early Adopters:
Ø RPA (e-discovery, QA)
Ø Customer Service (chatbots)
Ø Marketing (sales lead automation)
Ø Behaviour Design (captology)
Majority:
Ø SPAM filtering
Ø Business Analytics
Ø Risk Scoring (insurance, banking,
credit card)

TRACTION
Source: Grand View Research

INNOVATION
Sales (people) Automation

INNOVATION
NLP in (News) Media

INNOVATION
Automate Voice Customer Service

Cognitive computers are:
Ø Made with algorithms
Ø Knowledgeable ONLY about what taught
Ø Control ONLY what we give them control of
Ø Aware of nuances and can continue to learn
more
Cognitive Algorithms do:
Ø Do very boring work for you
Ø Often make better, more consistent decisions
than humans
Ø Be efficient, won’t get tired
DATA SCIENTIST TEAM

Machine Learning is mathematics / statistics
Ø Algebra Linear
Ø Calculus
Ø Theory of probabilities
Ø Graphic Theory
Ø …
Hardly a person knows all this.
It's a big field with lots of theory
It has two orthogonal aspect
Ø Analytics / machine learning
Ø
Ø Big data
Ø
Ø They can be combined or used separately
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
DATA SCIENTIST TEAM – SKILLS 1/3

DATA SCIENTIST TEAM – SKILLS 2/3

Ø Application Year 2000
Ø Dozens of servers
Ø Second Order
Ø Response Times
Ø Maintenance downtime hours
Ø Gigabyte order data
Ø Accessed by desktop devices
Ø Application Year >= 2010 : Modern ones.
Ø Clusters of thousands of multicore processes
Ø Millisecond Order response times
Ø 100% uptime
Ø Petabyte Order Data
Ø Accessed by any device
APPLICATION EVOLUTION

Ø Reactive application
Ø Orient yourself to objects
Ø Scalable
Ø Resilient/Elastic
Ø Responsive
Ø React to
Ø Events: event-driven nature enables other quality
Ø Failure: Resilient systems allow you to recover errors
at all levels
Ø Load: Scalability does not depend on shared resources
Ø Users: Response times should not depend on workload
APPLICATION EVOLUTION: NEW REQUIREMENTS

Ø No redesing to get the
scalibility
Ø Scalibility on-demand
Ø Risk-management
Ø Real-time, engaging,
reach, collaborative
Ø No latency in responses
Ø Loosely coupled design
Ø Communication orientation
Ø Efficient use of resources
Ø Downtime is a waste
of money
Ø Part of the design
REACTIVE MANIFESTO

APPLICATION/INFRASTRUCUTRE EVOLUTION

Ø The flow of data is always the same.
Ø Each BigData project has the following steps
DATA FLOW

BIG DATA – MICRO STANDARD ARCHITECTURE

Ø Only master for data that
contains all the knowledge
Ø Each application receives
data from the main data-
lake, but is then
independent for the analysis
Ø Allows you to crop /
anonymize / mask data for
the datascientists
BIG DATA – MULTI TENANT/INTELLIGENCE ARCHITECTURE

MACHINE LEARNING ARCHITECTURE FLOW

Ø Any database that is not a "relational database”
Ø The term was coined during a meet-up
Ø “Non-relational Databases”
Ø Not Only SQL
NOSQL

NoSQL
Key-
Value
ØRedis
ØVoldem
ort
ØDynomi
te
ØTokio*
BigTable
Clones
ØAccumulo
ØHbase
ØCassandr
a
Document
ØCouchDB
ØMongoD
B
ØElasticSe
arch
GraphDB
ØNeo4j
ØOrientDB
Ø…Graph
Message
Queue
ØKafka
ØRabbitMQ
Ø...MQ
500K ops/s for node
15K ops/s for node
Peta of data
300K ev/s for no
Complex Data
NOSQL

Most common tools:
Ø Apache Kafka
Ø RabbitMQ
Ø Apache MQ
Ø Redis
Ø Produce messages in "topics / queue”
Ø They serve the messages to the consumer
Ø Essential for back-pressure
Ø They have very little functionality: no queries
NOSQL – MESSAGE QUEUE 1/2

Ø Initially developed by LinkedIn and made
open-source in 2011
Ø Apache project since October 23, 2012
Ø In 2014 Confluent was founded by former
LinkedIn developers to provide business
support
Ø Diffused in any enterprise-level project /
infrastructure.
Ø Performance scalable linearly with the number
of nodes
NOSQL – APACHE KAFKA

Ø Designed to store very large
data sets (several Peta)
Ø The top market is dominated
by Apache Hbase and
Accumulo
Ø The insert number depends
only on the number of nodes
Ø They offer functionality to
extend them
NOSQL – COLUMNAR DATABASES 1/3

Ø 115M entries/s
Ø 216 Nodes
Ø 1296 Ingestion processes

Ø Full Text Search Engine
Ø Based on Lucene, written in Java 8
Ø “Distributed, (Near) Real Time, Search Engine”
Ø RESTful JSON HTTP Easy To Debug
Ø Free Schema
Ø Dynamic Mapping
Ø MultiTenant
Ø Scalable
Ø From 1 node to thousands of nodes
Ø Highly available
Ø Rich set of search functions
Ø Built in Analytics
Ø Rich set of search functions
Ø Open Source Apache 2.0)
Ø Originally written by Shay Bannon (Kimchy)
Ø Easy to install
ELASTICSEARCH

Ø Near realtime analytics in ms.
Ø Advanced Analytics
Ø Your “company” “Google” engine
Ø New approach to the Business
Ø Fast time to data gathering to results
Ø Few Low cost servers are able to process
so much data in milliseconds than a big
Hadoop cluster or a very expensive DBMS
solution
WHY ELASTICSEARCH?

Ø Clustering
Ø Association learning
Ø Parameter estimation
Ø Recommendation engines
Ø Classification
Ø Similarity matching
Ø Neural networks
Ø Bayesian networks
Ø Genetic algorithms
MACHINE LEARNING - ALGORITHM

Ø Traditional databases also run Big Data.
Ø NoSQL databases have poor analytics (except
Elasticsearch)
Ø Reduce Map often works on text files
Ø It can also work on data from SQL and NoSQL
Ø NoSQL allows greater throughput
Ø In general, you may have a mix of sources
Ø Text Files, NoSQL and SQL
MACHINE LEARNING – NOSQL AND BIGDATA

ØOne of the biggest problems
ØManually entered date is "suspicious”
ØMany datasets are profoundly problematic
ØSometimes recovering data is problematic:
ØSystematic problems with sensors
ØErrors that cause data loss
ØIncorrect metadata on sensors
ØNever, ever, believe the data without checking!
ØGarbage in, garbage out, etc => SIZE
MACHINE LEARNING – DATA QUALITY

ØSupervised (Supervised)
ØWe have a train dataset with the correct answers
ØWe use training data to instruct the algorithm
ØThen we apply the data without a response
ØNon-Supervised (Unsupervised)
ØThere is no training data
ØThe data is ingested into the algorithm hoping that it
creates a sense of the data.
ØAnd the date scientist can interpret them
MACHINE LEARNING – TYPES

ØPredictive
ØThey predict a variable from the data
ØClassification
ØThey assign records to predefined groups
ØClustering
ØShares records in similarity-based groups
ØAssociative Learning
ØEvaluate Record Relationships: "What Happens With
What"
MACHINE LEARNING – TYPES

ØThere is noise in the data
ØInput data is inaccurate
ØThere are hidden / latent values
ØInductive bias
ØEssentially the shape of the algorithm we
choose
ØNot all data can "fit”
ØIntroducing underfitting or overfitting
ØMachine Learning without Bias is not possible.
MACHINE LEARNING – PROBLEMS

ØTesting is essential
ØTesting means splitting data into 2 datasets:
ØTraining data (input for algorithms)
ØTest data (used for evaluation)
ØPerformance measures have to be calculated
ØPrecision / Recall
ØAverage Quadratic Error
MACHINE LEARNING – PROBLEMS

1. C4.5
2. k-means clustering
3. Support vector machines
4. The First algorithm
5. The EM algorithm
6. PageRank
7. AdaBoost
8. k-nearest neighbors class.
9. Naïve Bayes
10.CART
MACHINE LEARNING – TOP 10 ALGORITHMS

Ø Algorithm to build decision trees
Ø Essentially a tree of Boolean expressions
Ø Each node divides the data into 2
Ø Leaves associate objects with classes
Ø The Decision tree not only serves only for categorization
Ø They also teach us a lot about the classes
Ø C4.5 is a bit complex to learn
Ø ID3 algoritm is much simpler
Ø CART (# 10) is another algorithm for learning decision tree
MACHINE LEARNING – C4.5

ØIt is a way to perform binary classifications as matrices
ØSupport vectors are given points closest to a hyperplane
dividing classes
ØSVM maximizes the distance between the support vectors
(VS) and the edges.
MACHINE LEARNING – SUPPORT VECTOR MACHINES

ØIt is an algorithm for "frequent item groups”
ØEssentially it extracts which items appear frequently
together
ØFor example, what products are bought together with
the supermarket?
ØUsed by Amazon "Customers who bought this also
bought ...”
ØIt can also be used to create association rules
ØApriori is slow
MACHINE LEARNING – FIRST ALGORITHM

ØUsed in various contexts
ØVery difficult to understand what it does
ØVery heavy at mathematical level
ØIt is an iterative algorithm
ØJump between "step" step estimation of "maximization”
ØTry to optimize the output of a function
ØIt can be used for clustering
MACHINE LEARNING – EXPECTATION MAXIMIZATION

ØIt is a graph algorithm
ØDetermines the most important nodes
ØIt is used by Google to weight the results
ØIt can be applied to all graphs
ØFor RDF data campions
ØIt works by simulating random paths
ØEstimating the travel value of a given node in a given
time value
ØImplemented by linear algebra
MACHINE LEARNING – PAGERANK

ØAlgorithm for "learning ensemble”
ØCombines several algorithms
ØPerforms the same data
ØThe combination of multiple algorithms can be very
functional
ØBetter than just one algorithm
ØAdaBoost essentially weighs the training samples
ØGiving more weight to those who rank worse
MACHINE LEARNING – ADABOOST

ØGive a group of elements
ØMovies, books, ...
ØYou have a user rating
Ø1-5 starts, 1-10, ...
ØIt can be used to recommend items to a user depending on
other people's scores
ØFor this reason it is called collaborative filtering
MACHINE LEARNING – COLLABORATIVE FILTERING

ØA theorem that combines chances
ØI observed A that in saying that H is true with probability
70%
ØI observed B that in saying that H is true with probability
85%
ØWhat can I cough up?
ØThe Bayes theorem
ØWith the assumption that A and B are independent
ØThis assumption is almost always false from which "naive"
MACHINE LEARNING – NAÏVE BAYSIAN

ØWe have a set of numeric values for an object
ØWe want to use these values to predict a new value
ØExamples:
ØEstimating house costs
ØPrediction rating per object
Ø...
MACHINE LEARNING – LINEAR REGRESSION

LUISS - Deep Learning and data analyses - 09/01/19

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LUISS - Deep Learning and data analyses - 09/01/19

Similar to LUISS - Deep Learning and data analyses - 09/01/19 (20)

More from Alberto Paro

More from Alberto Paro (10)

Recently uploaded

Recently uploaded (8)

LUISS - Deep Learning and data analyses - 09/01/19