SlideShare a Scribd company logo
Data Analysis, Mobility,
Proximity and App-based
Marketing
Deep Learning and data analyses
A new perspective
on how data support companies
on strategic decisions.
Presenter
Alberto Paro
Date 09/01/19
Ø Master Degree in Computer Science Engineering at Politecnico di
Milano
Ø Author of 3 books about ElasticSearch from 1 to 5.x + 6 Tech
reviews
Ø Big Data Trainer, Developer and Consulting on Big data
Technologies (Akka, Playframework, Apache Spark, Reactive
Programming), NoSQL (Accumulo, Hbase, Cassandra,
ElasticSearch, Kafka and MongoDB) and Machine Learning
applied to Big Data.
Ø Evangelist for Scala e Scala.JS Language
ABOUT ME – ALBERTO PARO
Ø Big Data Concepts
Ø
Ø
Ø Market position
Ø
Ø
Ø
Ø
Ø Build a Solution for
Intelligence
Ø
Ø
Ø
Ø
Ø
Ø
3
TOPICS
BIG DATA CONCEPTS
BIG DATA 4V
The ‘Datafication’
Ø Activity
Ø Conversation
Ø Text
Ø Voice
Ø Social Media
Ø Browser logs
Ø Photos
Ø Videos
Ø IOT
Ø Etc.
Volume
Veracity
Variety
Velocity
Big Data Analysing:
Ø Text analytics
Ø Sentiment
analysis
Ø Face recognition
Ø Voice analytics
Ø Movement
analytics
Ø Etc.
Value
TRANSFORM BIG DATA IN VALUE
BIG DATA STORYMAP
ARTIFICIAL INTELLIGENCE/MACHINE LEARNING
ARTIFICIAL INTELLIGENCE: AI
The ability of a
machine to replicate
intelligent human
behavior
MACHINE LEARNING
The ability to improve
performance of a task
progressively,
without being
explicitly
programmed
4 BIG IDEAS
Data Driven Decision Making
Cloud Computing
Machine Learning
Cognitive
Computing:
ML + BigData + NLP
OUTLOOK
Worldwide Spending on
Cognitive and Artificial
Intelligence Systems reached
about $19.1 Billion in 2018
Source: IDC
40% of Digital Transformation
initiatives will use AI services.
AI spending will grow to $42.2
Billion in 2021.
Source: IDC
MACHINE LEARNING TYPES
MACHINE LEARNING LANDSCAPE
MACHINE LEARNING ANALYTICS
AI/ML MARKET
AI SPENDING
Retail $3.4B
Banking $3.3BManufactoring $2B
Healthcare $1.7B
Ø Customer Raccomandation
Ø Customer Profiling
Ø
Ø
Ø
Ø Customer Pre Selling
Ø
Ø
Ø Customer Post Selling
Ø
Ø
Ø Froud Detection
Ø Prediction Systems for Brookers (banking/finance)
AI TECHNOLOGIES – RETAIL AND BANKING
Ø Cost reduction via robots
Ø Creation of new products
Ø
Ø Quality monitoring
Ø
Ø Learning by example
Ø Predictive Maintainance
Ø
AI TECHNOLOGIES – MANUFACTORING
WHERE ARE COMPANIES SPENDING?
CLOUD
WHERE ARE COMPANIES SPENDING?
Data
Science
Teams
WHERE ARE COMPANIES SPENDING?
ML
Tools
WHERE ARE COMPANIES SPENDING?
Deep
Learning
Microservices
Mesh
WHERE ARE COMPANIES SPENDING?
Lots of
Proof-
Of-
Concepts
TRACTION
Early Adopters:
Ø RPA (e-discovery, QA)
Ø Customer Service (chatbots)
Ø Marketing (sales lead automation)
Ø Behaviour Design (captology)
Majority:
Ø SPAM filtering
Ø Business Analytics
Ø Risk Scoring (insurance, banking,
credit card)
TRACTION
Source: Grand View Research
INNOVATION
HR Analytics
INNOVATION
Sales (people) Automation
INNOVATION
NLP in (News) Media
INNOVATION
Automate Voice Customer Service
INNOVATION
AI in Healthcare
TEAM
Cognitive computers are:
Ø Made with algorithms
Ø Knowledgeable ONLY about what taught
Ø Control ONLY what we give them control of
Ø Aware of nuances and can continue to learn
more
Cognitive Algorithms do:
Ø Do very boring work for you
Ø Often make better, more consistent decisions
than humans
Ø Be efficient, won’t get tired
DATA SCIENTIST TEAM
Machine Learning is mathematics / statistics
Ø Algebra Linear
Ø Calculus
Ø Theory of probabilities
Ø Graphic Theory
Ø …
Hardly a person knows all this.
It's a big field with lots of theory
It has two orthogonal aspect
Ø Analytics / machine learning
Ø
Ø Big data
Ø
Ø They can be combined or used separately
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
DATA SCIENTIST TEAM – SKILLS 1/3
DATA SCIENTIST TEAM – SKILLS 2/3
DATA SCIENTIST TEAM – SKILLS 2/3
APPLICATION EVOLUTION
Ø Application Year 2000
Ø Dozens of servers
Ø Second Order
Ø Response Times
Ø Maintenance downtime hours
Ø Gigabyte order data
Ø Accessed by desktop devices
Ø Application Year >= 2010 : Modern ones.
Ø Clusters of thousands of multicore processes
Ø Millisecond Order response times
Ø 100% uptime
Ø Petabyte Order Data
Ø Accessed by any device
APPLICATION EVOLUTION
Ø Reactive application
Ø Orient yourself to objects
Ø Scalable
Ø Resilient/Elastic
Ø Responsive
Ø React to
Ø Events: event-driven nature enables other quality
Ø Failure: Resilient systems allow you to recover errors
at all levels
Ø Load: Scalability does not depend on shared resources
Ø Users: Response times should not depend on workload
APPLICATION EVOLUTION: NEW REQUIREMENTS
Ø No redesing to get the
scalibility
Ø Scalibility on-demand
Ø Risk-management
Ø Real-time, engaging,
reach, collaborative
Ø No latency in responses
Ø Loosely coupled design
Ø Communication orientation
Ø Efficient use of resources
Ø Downtime is a waste
of money
Ø Part of the design
REACTIVE MANIFESTO
APPLICATION/INFRASTRUCUTRE EVOLUTION
ARCHITECTURES
Ø The flow of data is always the same.
Ø Each BigData project has the following steps
DATA FLOW
DATA – BATCH VS STREAM
BIG DATA - MACRO ARCHITECTURE
BIG DATA – MICRO STANDARD ARCHITECTURE
Ø Only master for data that
contains all the knowledge
Ø Each application receives
data from the main data-
lake, but is then
independent for the analysis
Ø Allows you to crop /
anonymize / mask data for
the datascientists
BIG DATA – MULTI TENANT/INTELLIGENCE ARCHITECTURE
MACHINE LEARNING ARCHITECTURE FLOW
NOSQL
Ø Any database that is not a "relational database”
Ø The term was coined during a meet-up
Ø “Non-relational Databases”
Ø Not Only SQL
NOSQL
NoSQL
Key-
Value
ØRedis
ØVoldem
ort
ØDynomi
te
ØTokio*
BigTable
Clones
ØAccumulo
ØHbase
ØCassandr
a
Document
ØCouchDB
ØMongoD
B
ØElasticSe
arch
GraphDB
ØNeo4j
ØOrientDB
Ø…Graph
Message
Queue
ØKafka
ØRabbitMQ
Ø...MQ
500K ops/s for node
15K ops/s for node
Peta of data
300K ev/s for no
Complex Data
NOSQL
NOSQL
Most common tools:
Ø Apache Kafka
Ø RabbitMQ
Ø Apache MQ
Ø Redis
Ø Produce messages in "topics / queue”
Ø They serve the messages to the consumer
Ø Essential for back-pressure
Ø They have very little functionality: no queries
NOSQL – MESSAGE QUEUE 1/2
NOSQL – MESSAGE QUEUE 2/2
Ø Initially developed by LinkedIn and made
open-source in 2011
Ø Apache project since October 23, 2012
Ø In 2014 Confluent was founded by former
LinkedIn developers to provide business
support
Ø Diffused in any enterprise-level project /
infrastructure.
Ø Performance scalable linearly with the number
of nodes
NOSQL – APACHE KAFKA
Ø Designed to store very large
data sets (several Peta)
Ø The top market is dominated
by Apache Hbase and
Accumulo
Ø The insert number depends
only on the number of nodes
Ø They offer functionality to
extend them
NOSQL – COLUMNAR DATABASES 1/3
Ø 115M entries/s
Ø 216 Nodes
Ø 1296 Ingestion processes
NOSQL – COLUMNAR DATABASES 2/3
NOSQL – COLUMNAR DATABASES 3/3
Ø Full Text Search Engine
Ø Based on Lucene, written in Java 8
Ø “Distributed, (Near) Real Time, Search Engine”
Ø RESTful JSON HTTP Easy To Debug
Ø Free Schema
Ø Dynamic Mapping
Ø MultiTenant
Ø Scalable
Ø From 1 node to thousands of nodes
Ø Highly available
Ø Rich set of search functions
Ø Built in Analytics
Ø Rich set of search functions
Ø Open Source Apache 2.0)
Ø Originally written by Shay Bannon (Kimchy)
Ø Easy to install
ELASTICSEARCH
Ø Near realtime analytics in ms.
Ø Advanced Analytics
Ø Your “company” “Google” engine
Ø New approach to the Business
Ø Fast time to data gathering to results
Ø Few Low cost servers are able to process
so much data in milliseconds than a big
Hadoop cluster or a very expensive DBMS
solution
WHY ELASTICSEARCH?
ELASTICSEARCH CLIENTS
MACHINE LEARNING ALGORITHM
Ø Clustering
Ø Association learning
Ø Parameter estimation
Ø Recommendation engines
Ø Classification
Ø Similarity matching
Ø Neural networks
Ø Bayesian networks
Ø Genetic algorithms
MACHINE LEARNING - ALGORITHM
Ø Traditional databases also run Big Data.
Ø NoSQL databases have poor analytics (except
Elasticsearch)
Ø Reduce Map often works on text files
Ø It can also work on data from SQL and NoSQL
Ø NoSQL allows greater throughput
Ø In general, you may have a mix of sources
Ø Text Files, NoSQL and SQL
MACHINE LEARNING – NOSQL AND BIGDATA
ØOne of the biggest problems
ØManually entered date is "suspicious”
ØMany datasets are profoundly problematic
ØSometimes recovering data is problematic:
ØSystematic problems with sensors
ØErrors that cause data loss
ØIncorrect metadata on sensors
ØNever, ever, believe the data without checking!
ØGarbage in, garbage out, etc => SIZE
MACHINE LEARNING – DATA QUALITY
ØSupervised (Supervised)
ØWe have a train dataset with the correct answers
ØWe use training data to instruct the algorithm
ØThen we apply the data without a response
ØNon-Supervised (Unsupervised)
ØThere is no training data
ØThe data is ingested into the algorithm hoping that it
creates a sense of the data.
ØAnd the date scientist can interpret them
MACHINE LEARNING – TYPES
ØPredictive
ØThey predict a variable from the data
ØClassification
ØThey assign records to predefined groups
ØClustering
ØShares records in similarity-based groups
ØAssociative Learning
ØEvaluate Record Relationships: "What Happens With
What"
MACHINE LEARNING – TYPES
ØThere is noise in the data
ØInput data is inaccurate
ØThere are hidden / latent values
ØInductive bias
ØEssentially the shape of the algorithm we
choose
ØNot all data can "fit”
ØIntroducing underfitting or overfitting
ØMachine Learning without Bias is not possible.
MACHINE LEARNING – PROBLEMS
ØTesting is essential
ØTesting means splitting data into 2 datasets:
ØTraining data (input for algorithms)
ØTest data (used for evaluation)
ØPerformance measures have to be calculated
ØPrecision / Recall
ØAverage Quadratic Error
MACHINE LEARNING – PROBLEMS
1. C4.5
2. k-means clustering
3. Support vector machines
4. The First algorithm
5. The EM algorithm
6. PageRank
7. AdaBoost
8. k-nearest neighbors class.
9. Naïve Bayes
10.CART
MACHINE LEARNING – TOP 10 ALGORITHMS
Ø Algorithm to build decision trees
Ø Essentially a tree of Boolean expressions
Ø Each node divides the data into 2
Ø Leaves associate objects with classes
Ø The Decision tree not only serves only for categorization
Ø They also teach us a lot about the classes
Ø C4.5 is a bit complex to learn
Ø ID3 algoritm is much simpler
Ø CART (# 10) is another algorithm for learning decision tree
MACHINE LEARNING – C4.5
ØIt is a way to perform binary classifications as matrices
ØSupport vectors are given points closest to a hyperplane
dividing classes
ØSVM maximizes the distance between the support vectors
(VS) and the edges.
MACHINE LEARNING – SUPPORT VECTOR MACHINES
ØIt is an algorithm for "frequent item groups”
ØEssentially it extracts which items appear frequently
together
ØFor example, what products are bought together with
the supermarket?
ØUsed by Amazon "Customers who bought this also
bought ...”
ØIt can also be used to create association rules
ØApriori is slow
MACHINE LEARNING – FIRST ALGORITHM
ØUsed in various contexts
ØVery difficult to understand what it does
ØVery heavy at mathematical level
ØIt is an iterative algorithm
ØJump between "step" step estimation of "maximization”
ØTry to optimize the output of a function
ØIt can be used for clustering
MACHINE LEARNING – EXPECTATION MAXIMIZATION
ØIt is a graph algorithm
ØDetermines the most important nodes
ØIt is used by Google to weight the results
ØIt can be applied to all graphs
ØFor RDF data campions
ØIt works by simulating random paths
ØEstimating the travel value of a given node in a given
time value
ØImplemented by linear algebra
MACHINE LEARNING – PAGERANK
ØAlgorithm for "learning ensemble”
ØCombines several algorithms
ØPerforms the same data
ØThe combination of multiple algorithms can be very
functional
ØBetter than just one algorithm
ØAdaBoost essentially weighs the training samples
ØGiving more weight to those who rank worse
MACHINE LEARNING – ADABOOST
ØGive a group of elements
ØMovies, books, ...
ØYou have a user rating
Ø1-5 starts, 1-10, ...
ØIt can be used to recommend items to a user depending on
other people's scores
ØFor this reason it is called collaborative filtering
MACHINE LEARNING – COLLABORATIVE FILTERING
ØA theorem that combines chances
ØI observed A that in saying that H is true with probability
70%
ØI observed B that in saying that H is true with probability
85%
ØWhat can I cough up?
ØThe Bayes theorem
ØWith the assumption that A and B are independent
ØThis assumption is almost always false from which "naive"
MACHINE LEARNING – NAÏVE BAYSIAN
ØWe have a set of numeric values for an object
ØWe want to use these values to predict a new value
ØExamples:
ØEstimating house costs
ØPrediction rating per object
Ø...
MACHINE LEARNING – LINEAR REGRESSION
Thank you

More Related Content

What's hot

Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
Ajay Ohri
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
SwapnilDahake2
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
swethaT16
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
Sri Ambati
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ANOOP V S
 
Tales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareTales from an ip worker in consulting and software
Tales from an ip worker in consulting and software
Greg Makowski
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Thoughtworks
 
Data science
Data scienceData science
Data science
GitanshuSharma1
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Sri Ambati
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Alexander Bauer
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
Giuseppe Manco
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
Mark West
 
Data science 101
Data science 101Data science 101
Data science 101
University of West Florida
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Edureka!
 

What's hot (20)

Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Tales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareTales from an ip worker in consulting and software
Tales from an ip worker in consulting and software
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
 
Data science
Data scienceData science
Data science
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
Data science 101
Data science 101Data science 101
Data science 101
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 

Similar to LUISS - Deep Learning and data analyses - 09/01/19

Big Data in small words
Big Data in small wordsBig Data in small words
Big Data in small words
Yogesh Tomar
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Databricks
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
Pete Jarvis
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
Denodo
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
Joaquin Vanschoren
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Ali Alkan
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Ali Alkan
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
Haroon Karim
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
Amazon Web Services
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Osman Ali
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsRavi Teja
 
What is business analytics
What is business analyticsWhat is business analytics
What is business analytics
Sherpa Consulting
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
Denodo
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 

Similar to LUISS - Deep Learning and data analyses - 09/01/19 (20)

Big Data in small words
Big Data in small wordsBig Data in small words
Big Data in small words
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
 
What is business analytics
What is business analyticsWhat is business analytics
What is business analytics
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 

More from Alberto Paro

Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
 
2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns
Alberto Paro
 
Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017
Alberto Paro
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
Alberto Paro
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
Alberto Paro
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
Alberto Paro
 
2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data
Alberto Paro
 
What's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - FirenzeWhat's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - Firenze
Alberto Paro
 
ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014
Alberto Paro
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
Alberto Paro
 

More from Alberto Paro (10)

Data streaming
Data streamingData streaming
Data streaming
 
2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns
 
Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data
 
What's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - FirenzeWhat's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - Firenze
 
ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
 

Recently uploaded

Create a spend money transaction during bank reconciliation.pdf
Create a spend money transaction during bank reconciliation.pdfCreate a spend money transaction during bank reconciliation.pdf
Create a spend money transaction during bank reconciliation.pdf
andreakaterasco
 
Legal Mandates of technopreneurship.pptx
Legal Mandates of technopreneurship.pptxLegal Mandates of technopreneurship.pptx
Legal Mandates of technopreneurship.pptx
JadielByronAntonio
 
Showcase Portfolio- Marian Andrea Tana.pdf
Showcase Portfolio- Marian Andrea Tana.pdfShowcase Portfolio- Marian Andrea Tana.pdf
Showcase Portfolio- Marian Andrea Tana.pdf
MarianAndreaSTana
 
Zeeshan Hayat - A Guide to Efficient Business Management.pdf
Zeeshan Hayat - A Guide to Efficient Business Management.pdfZeeshan Hayat - A Guide to Efficient Business Management.pdf
Zeeshan Hayat - A Guide to Efficient Business Management.pdf
Zeeshan Hayat
 
在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样
在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样
在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样
ch775c0l
 
Get To Know About Salma Karina Hayat.pdf
Get To Know About Salma Karina Hayat.pdfGet To Know About Salma Karina Hayat.pdf
Get To Know About Salma Karina Hayat.pdf
Salma Karina Hayat
 
Michael Economou - Don't build a marketplace.pdf
Michael Economou - Don't build a marketplace.pdfMichael Economou - Don't build a marketplace.pdf
Michael Economou - Don't build a marketplace.pdf
Michael Oikonomou
 
在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样
在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样
在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样
pv4uhplv
 

Recently uploaded (8)

Create a spend money transaction during bank reconciliation.pdf
Create a spend money transaction during bank reconciliation.pdfCreate a spend money transaction during bank reconciliation.pdf
Create a spend money transaction during bank reconciliation.pdf
 
Legal Mandates of technopreneurship.pptx
Legal Mandates of technopreneurship.pptxLegal Mandates of technopreneurship.pptx
Legal Mandates of technopreneurship.pptx
 
Showcase Portfolio- Marian Andrea Tana.pdf
Showcase Portfolio- Marian Andrea Tana.pdfShowcase Portfolio- Marian Andrea Tana.pdf
Showcase Portfolio- Marian Andrea Tana.pdf
 
Zeeshan Hayat - A Guide to Efficient Business Management.pdf
Zeeshan Hayat - A Guide to Efficient Business Management.pdfZeeshan Hayat - A Guide to Efficient Business Management.pdf
Zeeshan Hayat - A Guide to Efficient Business Management.pdf
 
在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样
在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样
在线办理(加拿大Concordia毕业证书)康考迪亚大学毕业证学历证书一模一样
 
Get To Know About Salma Karina Hayat.pdf
Get To Know About Salma Karina Hayat.pdfGet To Know About Salma Karina Hayat.pdf
Get To Know About Salma Karina Hayat.pdf
 
Michael Economou - Don't build a marketplace.pdf
Michael Economou - Don't build a marketplace.pdfMichael Economou - Don't build a marketplace.pdf
Michael Economou - Don't build a marketplace.pdf
 
在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样
在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样
在线办理(uofc毕业证书)芝加哥大学毕业证学历学位证书原版一模一样
 

LUISS - Deep Learning and data analyses - 09/01/19

  • 1. Data Analysis, Mobility, Proximity and App-based Marketing Deep Learning and data analyses A new perspective on how data support companies on strategic decisions. Presenter Alberto Paro Date 09/01/19
  • 2. Ø Master Degree in Computer Science Engineering at Politecnico di Milano Ø Author of 3 books about ElasticSearch from 1 to 5.x + 6 Tech reviews Ø Big Data Trainer, Developer and Consulting on Big data Technologies (Akka, Playframework, Apache Spark, Reactive Programming), NoSQL (Accumulo, Hbase, Cassandra, ElasticSearch, Kafka and MongoDB) and Machine Learning applied to Big Data. Ø Evangelist for Scala e Scala.JS Language ABOUT ME – ALBERTO PARO
  • 3. Ø Big Data Concepts Ø Ø Ø Market position Ø Ø Ø Ø Ø Build a Solution for Intelligence Ø Ø Ø Ø Ø Ø 3 TOPICS
  • 6. The ‘Datafication’ Ø Activity Ø Conversation Ø Text Ø Voice Ø Social Media Ø Browser logs Ø Photos Ø Videos Ø IOT Ø Etc. Volume Veracity Variety Velocity Big Data Analysing: Ø Text analytics Ø Sentiment analysis Ø Face recognition Ø Voice analytics Ø Movement analytics Ø Etc. Value TRANSFORM BIG DATA IN VALUE
  • 9. ARTIFICIAL INTELLIGENCE: AI The ability of a machine to replicate intelligent human behavior
  • 10. MACHINE LEARNING The ability to improve performance of a task progressively, without being explicitly programmed
  • 11. 4 BIG IDEAS Data Driven Decision Making Cloud Computing Machine Learning Cognitive Computing: ML + BigData + NLP
  • 12. OUTLOOK Worldwide Spending on Cognitive and Artificial Intelligence Systems reached about $19.1 Billion in 2018 Source: IDC 40% of Digital Transformation initiatives will use AI services. AI spending will grow to $42.2 Billion in 2021. Source: IDC
  • 17. AI SPENDING Retail $3.4B Banking $3.3BManufactoring $2B Healthcare $1.7B
  • 18. Ø Customer Raccomandation Ø Customer Profiling Ø Ø Ø Ø Customer Pre Selling Ø Ø Ø Customer Post Selling Ø Ø Ø Froud Detection Ø Prediction Systems for Brookers (banking/finance) AI TECHNOLOGIES – RETAIL AND BANKING
  • 19. Ø Cost reduction via robots Ø Creation of new products Ø Ø Quality monitoring Ø Ø Learning by example Ø Predictive Maintainance Ø AI TECHNOLOGIES – MANUFACTORING
  • 20. WHERE ARE COMPANIES SPENDING? CLOUD
  • 21. WHERE ARE COMPANIES SPENDING? Data Science Teams
  • 22. WHERE ARE COMPANIES SPENDING? ML Tools
  • 23. WHERE ARE COMPANIES SPENDING? Deep Learning Microservices Mesh
  • 24. WHERE ARE COMPANIES SPENDING? Lots of Proof- Of- Concepts
  • 25. TRACTION Early Adopters: Ø RPA (e-discovery, QA) Ø Customer Service (chatbots) Ø Marketing (sales lead automation) Ø Behaviour Design (captology) Majority: Ø SPAM filtering Ø Business Analytics Ø Risk Scoring (insurance, banking, credit card)
  • 32. TEAM
  • 33. Cognitive computers are: Ø Made with algorithms Ø Knowledgeable ONLY about what taught Ø Control ONLY what we give them control of Ø Aware of nuances and can continue to learn more Cognitive Algorithms do: Ø Do very boring work for you Ø Often make better, more consistent decisions than humans Ø Be efficient, won’t get tired DATA SCIENTIST TEAM
  • 34. Machine Learning is mathematics / statistics Ø Algebra Linear Ø Calculus Ø Theory of probabilities Ø Graphic Theory Ø … Hardly a person knows all this. It's a big field with lots of theory It has two orthogonal aspect Ø Analytics / machine learning Ø Ø Big data Ø Ø They can be combined or used separately http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram DATA SCIENTIST TEAM – SKILLS 1/3
  • 35. DATA SCIENTIST TEAM – SKILLS 2/3
  • 36. DATA SCIENTIST TEAM – SKILLS 2/3
  • 38. Ø Application Year 2000 Ø Dozens of servers Ø Second Order Ø Response Times Ø Maintenance downtime hours Ø Gigabyte order data Ø Accessed by desktop devices Ø Application Year >= 2010 : Modern ones. Ø Clusters of thousands of multicore processes Ø Millisecond Order response times Ø 100% uptime Ø Petabyte Order Data Ø Accessed by any device APPLICATION EVOLUTION
  • 39. Ø Reactive application Ø Orient yourself to objects Ø Scalable Ø Resilient/Elastic Ø Responsive Ø React to Ø Events: event-driven nature enables other quality Ø Failure: Resilient systems allow you to recover errors at all levels Ø Load: Scalability does not depend on shared resources Ø Users: Response times should not depend on workload APPLICATION EVOLUTION: NEW REQUIREMENTS
  • 40. Ø No redesing to get the scalibility Ø Scalibility on-demand Ø Risk-management Ø Real-time, engaging, reach, collaborative Ø No latency in responses Ø Loosely coupled design Ø Communication orientation Ø Efficient use of resources Ø Downtime is a waste of money Ø Part of the design REACTIVE MANIFESTO
  • 43. Ø The flow of data is always the same. Ø Each BigData project has the following steps DATA FLOW
  • 44. DATA – BATCH VS STREAM
  • 45. BIG DATA - MACRO ARCHITECTURE
  • 46. BIG DATA – MICRO STANDARD ARCHITECTURE
  • 47. Ø Only master for data that contains all the knowledge Ø Each application receives data from the main data- lake, but is then independent for the analysis Ø Allows you to crop / anonymize / mask data for the datascientists BIG DATA – MULTI TENANT/INTELLIGENCE ARCHITECTURE
  • 49. NOSQL
  • 50. Ø Any database that is not a "relational database” Ø The term was coined during a meet-up Ø “Non-relational Databases” Ø Not Only SQL NOSQL
  • 52. NOSQL
  • 53. Most common tools: Ø Apache Kafka Ø RabbitMQ Ø Apache MQ Ø Redis Ø Produce messages in "topics / queue” Ø They serve the messages to the consumer Ø Essential for back-pressure Ø They have very little functionality: no queries NOSQL – MESSAGE QUEUE 1/2
  • 54. NOSQL – MESSAGE QUEUE 2/2
  • 55. Ø Initially developed by LinkedIn and made open-source in 2011 Ø Apache project since October 23, 2012 Ø In 2014 Confluent was founded by former LinkedIn developers to provide business support Ø Diffused in any enterprise-level project / infrastructure. Ø Performance scalable linearly with the number of nodes NOSQL – APACHE KAFKA
  • 56. Ø Designed to store very large data sets (several Peta) Ø The top market is dominated by Apache Hbase and Accumulo Ø The insert number depends only on the number of nodes Ø They offer functionality to extend them NOSQL – COLUMNAR DATABASES 1/3
  • 57. Ø 115M entries/s Ø 216 Nodes Ø 1296 Ingestion processes NOSQL – COLUMNAR DATABASES 2/3
  • 58. NOSQL – COLUMNAR DATABASES 3/3
  • 59. Ø Full Text Search Engine Ø Based on Lucene, written in Java 8 Ø “Distributed, (Near) Real Time, Search Engine” Ø RESTful JSON HTTP Easy To Debug Ø Free Schema Ø Dynamic Mapping Ø MultiTenant Ø Scalable Ø From 1 node to thousands of nodes Ø Highly available Ø Rich set of search functions Ø Built in Analytics Ø Rich set of search functions Ø Open Source Apache 2.0) Ø Originally written by Shay Bannon (Kimchy) Ø Easy to install ELASTICSEARCH
  • 60. Ø Near realtime analytics in ms. Ø Advanced Analytics Ø Your “company” “Google” engine Ø New approach to the Business Ø Fast time to data gathering to results Ø Few Low cost servers are able to process so much data in milliseconds than a big Hadoop cluster or a very expensive DBMS solution WHY ELASTICSEARCH?
  • 63. Ø Clustering Ø Association learning Ø Parameter estimation Ø Recommendation engines Ø Classification Ø Similarity matching Ø Neural networks Ø Bayesian networks Ø Genetic algorithms MACHINE LEARNING - ALGORITHM
  • 64. Ø Traditional databases also run Big Data. Ø NoSQL databases have poor analytics (except Elasticsearch) Ø Reduce Map often works on text files Ø It can also work on data from SQL and NoSQL Ø NoSQL allows greater throughput Ø In general, you may have a mix of sources Ø Text Files, NoSQL and SQL MACHINE LEARNING – NOSQL AND BIGDATA
  • 65. ØOne of the biggest problems ØManually entered date is "suspicious” ØMany datasets are profoundly problematic ØSometimes recovering data is problematic: ØSystematic problems with sensors ØErrors that cause data loss ØIncorrect metadata on sensors ØNever, ever, believe the data without checking! ØGarbage in, garbage out, etc => SIZE MACHINE LEARNING – DATA QUALITY
  • 66. ØSupervised (Supervised) ØWe have a train dataset with the correct answers ØWe use training data to instruct the algorithm ØThen we apply the data without a response ØNon-Supervised (Unsupervised) ØThere is no training data ØThe data is ingested into the algorithm hoping that it creates a sense of the data. ØAnd the date scientist can interpret them MACHINE LEARNING – TYPES
  • 67. ØPredictive ØThey predict a variable from the data ØClassification ØThey assign records to predefined groups ØClustering ØShares records in similarity-based groups ØAssociative Learning ØEvaluate Record Relationships: "What Happens With What" MACHINE LEARNING – TYPES
  • 68. ØThere is noise in the data ØInput data is inaccurate ØThere are hidden / latent values ØInductive bias ØEssentially the shape of the algorithm we choose ØNot all data can "fit” ØIntroducing underfitting or overfitting ØMachine Learning without Bias is not possible. MACHINE LEARNING – PROBLEMS
  • 69. ØTesting is essential ØTesting means splitting data into 2 datasets: ØTraining data (input for algorithms) ØTest data (used for evaluation) ØPerformance measures have to be calculated ØPrecision / Recall ØAverage Quadratic Error MACHINE LEARNING – PROBLEMS
  • 70. 1. C4.5 2. k-means clustering 3. Support vector machines 4. The First algorithm 5. The EM algorithm 6. PageRank 7. AdaBoost 8. k-nearest neighbors class. 9. Naïve Bayes 10.CART MACHINE LEARNING – TOP 10 ALGORITHMS
  • 71. Ø Algorithm to build decision trees Ø Essentially a tree of Boolean expressions Ø Each node divides the data into 2 Ø Leaves associate objects with classes Ø The Decision tree not only serves only for categorization Ø They also teach us a lot about the classes Ø C4.5 is a bit complex to learn Ø ID3 algoritm is much simpler Ø CART (# 10) is another algorithm for learning decision tree MACHINE LEARNING – C4.5
  • 72. ØIt is a way to perform binary classifications as matrices ØSupport vectors are given points closest to a hyperplane dividing classes ØSVM maximizes the distance between the support vectors (VS) and the edges. MACHINE LEARNING – SUPPORT VECTOR MACHINES
  • 73. ØIt is an algorithm for "frequent item groups” ØEssentially it extracts which items appear frequently together ØFor example, what products are bought together with the supermarket? ØUsed by Amazon "Customers who bought this also bought ...” ØIt can also be used to create association rules ØApriori is slow MACHINE LEARNING – FIRST ALGORITHM
  • 74. ØUsed in various contexts ØVery difficult to understand what it does ØVery heavy at mathematical level ØIt is an iterative algorithm ØJump between "step" step estimation of "maximization” ØTry to optimize the output of a function ØIt can be used for clustering MACHINE LEARNING – EXPECTATION MAXIMIZATION
  • 75. ØIt is a graph algorithm ØDetermines the most important nodes ØIt is used by Google to weight the results ØIt can be applied to all graphs ØFor RDF data campions ØIt works by simulating random paths ØEstimating the travel value of a given node in a given time value ØImplemented by linear algebra MACHINE LEARNING – PAGERANK
  • 76. ØAlgorithm for "learning ensemble” ØCombines several algorithms ØPerforms the same data ØThe combination of multiple algorithms can be very functional ØBetter than just one algorithm ØAdaBoost essentially weighs the training samples ØGiving more weight to those who rank worse MACHINE LEARNING – ADABOOST
  • 77. ØGive a group of elements ØMovies, books, ... ØYou have a user rating Ø1-5 starts, 1-10, ... ØIt can be used to recommend items to a user depending on other people's scores ØFor this reason it is called collaborative filtering MACHINE LEARNING – COLLABORATIVE FILTERING
  • 78. ØA theorem that combines chances ØI observed A that in saying that H is true with probability 70% ØI observed B that in saying that H is true with probability 85% ØWhat can I cough up? ØThe Bayes theorem ØWith the assumption that A and B are independent ØThis assumption is almost always false from which "naive" MACHINE LEARNING – NAÏVE BAYSIAN
  • 79. ØWe have a set of numeric values for an object ØWe want to use these values to predict a new value ØExamples: ØEstimating house costs ØPrediction rating per object Ø... MACHINE LEARNING – LINEAR REGRESSION