Slidedeck from our seminar about Data Science (30/09/2014)
Topics covered:
- What is Data Science?
- What can Data Science do for your business?
- How does Data Science relate to Statistics, BI and BigData?
- Practical application of data mining techniques: decision trees, naive bayes, k-means clustering, a priori
- Real-world case of applied data science
Harvesting business Value with Data ScienceInfoFarm
Slidedeck from our seminar on "Harvesting Business Value with Data Science" (18/03/2015)
Topics covered:
- What is Data Science?
- Data Science: Tools and Techniques
- Data Science examples:
- Market segmentation
- Impact analysis
- Recommendations
- Water treatment
- Damage type research
- Call center aid
- Personalized client mailing (Essent)
- What do people write about us
- Fraud detection: Gotch’All (KU Leuven)
This document outlines an agenda for a seminar on data science applications for e-commerce. It discusses how data science can be used to improve recommendations, analyze the relationship between physical and online sales, enable dynamic pricing and personalized offerings, gather and use external data, optimize order fulfillment through anticipatory shipping, and improve customer service. Specific examples are provided for how data mining techniques can be applied to transaction data, web logs, product data and other sources to gain insights.
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceInfoFarm
This document discusses how data science can be applied to e-commerce. It begins with an agenda that outlines discussing data science and big data in e-commerce, examples of data science applications, and how to apply data science. Examples of data science applications discussed include recommendations, personalized offerings, anticipatory shipping, and customer service optimizations. The document provides details on each application and emphasizes using data to gain insights about customers rather than making assumptions. It also stresses testing hypotheses with A/B testing before full rollout.
The document is an agenda for a seminar on machine learning techniques and tools. It will cover an introduction to machine learning, common techniques like classification, clustering and regression. It will also discuss tools for machine learning like Apache Mahout, Weka, Spark MLLib and R. Finally, it will include a hands-on demonstration of machine learning algorithms and discuss benefits of using machine learning.
This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
Harvesting business Value with Data ScienceInfoFarm
Slidedeck from our seminar on "Harvesting Business Value with Data Science" (18/03/2015)
Topics covered:
- What is Data Science?
- Data Science: Tools and Techniques
- Data Science examples:
- Market segmentation
- Impact analysis
- Recommendations
- Water treatment
- Damage type research
- Call center aid
- Personalized client mailing (Essent)
- What do people write about us
- Fraud detection: Gotch’All (KU Leuven)
This document outlines an agenda for a seminar on data science applications for e-commerce. It discusses how data science can be used to improve recommendations, analyze the relationship between physical and online sales, enable dynamic pricing and personalized offerings, gather and use external data, optimize order fulfillment through anticipatory shipping, and improve customer service. Specific examples are provided for how data mining techniques can be applied to transaction data, web logs, product data and other sources to gain insights.
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceInfoFarm
This document discusses how data science can be applied to e-commerce. It begins with an agenda that outlines discussing data science and big data in e-commerce, examples of data science applications, and how to apply data science. Examples of data science applications discussed include recommendations, personalized offerings, anticipatory shipping, and customer service optimizations. The document provides details on each application and emphasizes using data to gain insights about customers rather than making assumptions. It also stresses testing hypotheses with A/B testing before full rollout.
The document is an agenda for a seminar on machine learning techniques and tools. It will cover an introduction to machine learning, common techniques like classification, clustering and regression. It will also discuss tools for machine learning like Apache Mahout, Weka, Spark MLLib and R. Finally, it will include a hands-on demonstration of machine learning algorithms and discuss benefits of using machine learning.
This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
Jump start into 2013 by exploring how Big Data can transform your business. Listen to Infochimps Director of Product, Tim Gasper, cover the leading use cases for 2013, sharing where the data comes from, how the systems are architected and most importantly, how they drive business insights for data-driven decisions.
About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.
A presentation that gives an overview about latest machine learning and deep learning techniques and use-cases that are prevalent in the eCommerce industry
This document summarizes a presentation given by Revolution Analytics on using R for marketing analytics. It discusses challenges like needing to make decisions faster based on more data and predictive models. It provides examples of companies using Revolution's R software to improve results, such as increasing lift for a client by 14% and saving another $270k. The presentation promotes Revolution's R software for handling big data and analytics faster through techniques like parallel processing and distributed computing. It argues Revolution R is the leading commercial provider of high performance R software.
This document provides an overview of data science and its applications. It discusses:
1) Industries that are being disrupted by data science like telecom, banking, retail, and healthcare.
2) How companies like Amazon, Netflix, and Google were able to disrupt their industries through their ability to analyze patterns in data faster than competitors.
3) The factors driving more companies to adopt data science including competitive advantages, revenue growth, and cost optimization.
Graphs & the Police: How Law Enforcement Analyze Connected Data at ScaleNeo4j
Law enforcement agencies are trailblazers of using graph analysis to understand connections. Manual and partially automated link analysis tools have been crucial in an investigation and situational awareness capacity for several decades.
Meanwhile, the global explosion in data volumes and sources hasn't been limited to the private sector. Law Enforcement agencies, departments and fusion centers use a vast array of databases and sources, including Record Management Systems (RMS), Computer Aided Dispatch (CAD) and countless other sources.
In this webinar, Christian Miles of Cambridge Intelligence (makers of KeyLines) will introduce the benefits of graph technologies for law enforcement. He will show how to use Neo4j with compelling graph visualization techniques to improve performance and analytics when working with large volumes of law enforcement data.
The document provides an overview of data science applications and use cases. It defines data science as using computer science, statistics, machine learning and other techniques to analyze data and create data products to help businesses make better decisions. It discusses big data challenges, the differences between data science and software engineering, and key areas of data science competence including data analytics, engineering, domain expertise and data management. Finally, it outlines several common data science applications and use cases such as recommender systems, credit scoring, dynamic pricing, customer churn analysis and fraud detection with examples of how each works and real world cases.
The document discusses big data analytics, including its characteristics, tools, and applications. It defines big data analytics as the application of advanced analytics techniques to large datasets. Big data is characterized by its volume, variety, and velocity. New tools and methods are needed to store, manage, and analyze big data. The document reviews different big data storage, processing, and analytics tools and methods that can be applied in decision making.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
The document outlines the typical lifecycle of a data science project, including business requirements, data acquisition, data preparation, hypothesis and modeling, evaluation and interpretation, and deployment. It discusses collecting data from various sources, cleaning and integrating data in the preparation stage, selecting and engineering features, building and validating models, and ultimately deploying results.
The document discusses the field of data mining. It begins by defining data mining and describing its branches including classification, clustering, and association rule mining. It then discusses the growth of data in various domains that has created opportunities for data mining applications. The document outlines the history and development of data mining from empirical science to computational science to data science. It provides examples of data mining applications in various domains like healthcare, energy, climate science, and agriculture. Finally, it discusses future directions and challenges for the field of data mining.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
This document discusses tools and technologies used in data science. It covers popular programming languages like Python, R, Java and C++. It also discusses databases, data analytics tools, APIs, servers, and frameworks. Specific tools mentioned include Hadoop, Spark, Tableau, IBM SPSS, SAS, and Excel. The document provides brief descriptions and examples of how these various tools are used in data science.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
Una breve introduzione alla data science e al machine learning con un'enfasi sugli scenari applicativi, da quelli tradizionali a quelli più innovativi. La overview copre la definizione di base di data science, una overview del machine learning e esempi su scenari tradizionali, Recommender systems e Social Network Analysis, IoT e Deep Learning
This document provides an overview of a data science course. It discusses topics like big data, data science components, use cases, Hadoop, R, and machine learning. The course objectives are to understand big data challenges, implement big data solutions, learn about data science components and prospects, analyze use cases using R and Hadoop, and understand machine learning concepts. The document outlines the topics that will be covered each day of the course including big data scenarios, introduction to data science, types of data scientists, and more.
This document provides an introduction to data science and analytics. It discusses why data science jobs are in high demand, what skills are needed for these roles, and common types of analytics including descriptive, predictive, and prescriptive. It also covers topics like machine learning, big data, structured vs unstructured data, and examples of companies that utilize data and analytics like Amazon and Facebook. The document is intended to explain key concepts in data science and why attending a talk on this topic would be beneficial.
Big Data: The 6 Key Skills Every Business NeedsBernard Marr
Here are the 6 most important skills businesses require to address their big data needs.It is based on this blog post http://ow.ly/EQUhb by Bernard Marr.
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Big Data Analytics.pdfbgfjgjgghfhhffhdfyfVijayKaran7
This document discusses big data analytics in business management. It defines big data and its key characteristics of volume, velocity, and variety. Large amounts of structured, unstructured, and semi-structured data are now being generated from various sources like web data, purchases, social media, sensors, etc. Big data analytics uses techniques like MapReduce and Hadoop to extract insights from large, diverse datasets in real-time to improve areas like customer experience, operations, and decision making for businesses. Location analytics and web analytics are also highlighted as important applications of big data.
Jump start into 2013 by exploring how Big Data can transform your business. Listen to Infochimps Director of Product, Tim Gasper, cover the leading use cases for 2013, sharing where the data comes from, how the systems are architected and most importantly, how they drive business insights for data-driven decisions.
About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.
A presentation that gives an overview about latest machine learning and deep learning techniques and use-cases that are prevalent in the eCommerce industry
This document summarizes a presentation given by Revolution Analytics on using R for marketing analytics. It discusses challenges like needing to make decisions faster based on more data and predictive models. It provides examples of companies using Revolution's R software to improve results, such as increasing lift for a client by 14% and saving another $270k. The presentation promotes Revolution's R software for handling big data and analytics faster through techniques like parallel processing and distributed computing. It argues Revolution R is the leading commercial provider of high performance R software.
This document provides an overview of data science and its applications. It discusses:
1) Industries that are being disrupted by data science like telecom, banking, retail, and healthcare.
2) How companies like Amazon, Netflix, and Google were able to disrupt their industries through their ability to analyze patterns in data faster than competitors.
3) The factors driving more companies to adopt data science including competitive advantages, revenue growth, and cost optimization.
Graphs & the Police: How Law Enforcement Analyze Connected Data at ScaleNeo4j
Law enforcement agencies are trailblazers of using graph analysis to understand connections. Manual and partially automated link analysis tools have been crucial in an investigation and situational awareness capacity for several decades.
Meanwhile, the global explosion in data volumes and sources hasn't been limited to the private sector. Law Enforcement agencies, departments and fusion centers use a vast array of databases and sources, including Record Management Systems (RMS), Computer Aided Dispatch (CAD) and countless other sources.
In this webinar, Christian Miles of Cambridge Intelligence (makers of KeyLines) will introduce the benefits of graph technologies for law enforcement. He will show how to use Neo4j with compelling graph visualization techniques to improve performance and analytics when working with large volumes of law enforcement data.
The document provides an overview of data science applications and use cases. It defines data science as using computer science, statistics, machine learning and other techniques to analyze data and create data products to help businesses make better decisions. It discusses big data challenges, the differences between data science and software engineering, and key areas of data science competence including data analytics, engineering, domain expertise and data management. Finally, it outlines several common data science applications and use cases such as recommender systems, credit scoring, dynamic pricing, customer churn analysis and fraud detection with examples of how each works and real world cases.
The document discusses big data analytics, including its characteristics, tools, and applications. It defines big data analytics as the application of advanced analytics techniques to large datasets. Big data is characterized by its volume, variety, and velocity. New tools and methods are needed to store, manage, and analyze big data. The document reviews different big data storage, processing, and analytics tools and methods that can be applied in decision making.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
The document outlines the typical lifecycle of a data science project, including business requirements, data acquisition, data preparation, hypothesis and modeling, evaluation and interpretation, and deployment. It discusses collecting data from various sources, cleaning and integrating data in the preparation stage, selecting and engineering features, building and validating models, and ultimately deploying results.
The document discusses the field of data mining. It begins by defining data mining and describing its branches including classification, clustering, and association rule mining. It then discusses the growth of data in various domains that has created opportunities for data mining applications. The document outlines the history and development of data mining from empirical science to computational science to data science. It provides examples of data mining applications in various domains like healthcare, energy, climate science, and agriculture. Finally, it discusses future directions and challenges for the field of data mining.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
This document discusses tools and technologies used in data science. It covers popular programming languages like Python, R, Java and C++. It also discusses databases, data analytics tools, APIs, servers, and frameworks. Specific tools mentioned include Hadoop, Spark, Tableau, IBM SPSS, SAS, and Excel. The document provides brief descriptions and examples of how these various tools are used in data science.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
Una breve introduzione alla data science e al machine learning con un'enfasi sugli scenari applicativi, da quelli tradizionali a quelli più innovativi. La overview copre la definizione di base di data science, una overview del machine learning e esempi su scenari tradizionali, Recommender systems e Social Network Analysis, IoT e Deep Learning
This document provides an overview of a data science course. It discusses topics like big data, data science components, use cases, Hadoop, R, and machine learning. The course objectives are to understand big data challenges, implement big data solutions, learn about data science components and prospects, analyze use cases using R and Hadoop, and understand machine learning concepts. The document outlines the topics that will be covered each day of the course including big data scenarios, introduction to data science, types of data scientists, and more.
This document provides an introduction to data science and analytics. It discusses why data science jobs are in high demand, what skills are needed for these roles, and common types of analytics including descriptive, predictive, and prescriptive. It also covers topics like machine learning, big data, structured vs unstructured data, and examples of companies that utilize data and analytics like Amazon and Facebook. The document is intended to explain key concepts in data science and why attending a talk on this topic would be beneficial.
Big Data: The 6 Key Skills Every Business NeedsBernard Marr
Here are the 6 most important skills businesses require to address their big data needs.It is based on this blog post http://ow.ly/EQUhb by Bernard Marr.
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Big Data Analytics.pdfbgfjgjgghfhhffhdfyfVijayKaran7
This document discusses big data analytics in business management. It defines big data and its key characteristics of volume, velocity, and variety. Large amounts of structured, unstructured, and semi-structured data are now being generated from various sources like web data, purchases, social media, sensors, etc. Big data analytics uses techniques like MapReduce and Hadoop to extract insights from large, diverse datasets in real-time to improve areas like customer experience, operations, and decision making for businesses. Location analytics and web analytics are also highlighted as important applications of big data.
Susan Etlinger is an industry analyst who focuses on data and analytics. She has authored two reports on social media ROI and social analytics. She advises clients on measurement strategies and extracting insights from social data. She also works with technology companies to refine their strategies.
Big data refers to very large data sets that cannot be analyzed using traditional techniques. It is characterized by volume, velocity, and variety. Analyzing big data can help solve problems and generate value. The amount of data is growing exponentially from various sources like customer transactions, photos, and genome sequencing. This growth is driving changes in analytics approaches and capabilities.
Data science brings together techniques from computer science, statistics, mathematics, and domain knowledge to extract insights from
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
Predictive analytics can help reduce volatility and improve decision making in procurement processes. It allows understanding of future costs, demand, and supply to overcome challenges. Predictive models analyze past data and behaviors to forecast trends and outcomes. As data sources like IoT sensors expand, predictive analytics is increasingly used for applications like manufacturing process improvement, predictive maintenance of equipment, and optimizing building energy usage.
Building the Artificially Intelligent EnterpriseDatabricks
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited and specializes in business intelligence/analytics and data management. He discusses building the artificially intelligent enterprise and transitioning to a self-learning enterprise. Some key challenges discussed include the siloed and fractured nature of current data and analytics efforts, with many tools and scripts in use without integration. He advocates sorting out the data foundation, implementing DataOps and MLOps, creating a data and analytics marketplace, and integrating analytics into business processes to drive value from AI.
Big data is generated from a variety of sources like web data, purchases, social networks, sensors, and IoT devices. Telecom companies process exabytes and zettabytes of data daily, including call detail records, network configuration data, and customer information. This big data is analyzed to enhance customer experience through personalization, predict churn, and optimize networks. Analytics also helps with operations, data monetization through services, and identifying new revenue streams from IoT and M2M data. Frameworks like Hadoop and MapReduce are used to analyze this distributed big data across clusters in a distributed manner for faster insights.
Ch1-Introduction to Business Intelligence.pptxsommaikhantong
The document discusses business intelligence systems (BIS). It defines BIS as an analytical information system built on a data warehouse that uses tools like multidimensional analysis and data mining. The main components of BIS are the data warehouse, business analytics tools, business performance management, and user interfaces. BIS applications include accounting, inventory control, production management, and human resources. The document also discusses data warehousing, business analytics tools, and how technology changes have enabled more widespread use of BI.
Big data analytics (BDA) involves examining large, diverse datasets to uncover hidden patterns, correlations, trends, and insights. BDA helps organizations gain a competitive advantage by extracting insights from data to make faster, more informed decisions. It supports a 360-degree view of customers by analyzing both structured and unstructured data sources like clickstream data. Businesses can leverage techniques like machine learning, predictive analytics, and natural language processing on existing and new data sources. BDA requires close collaboration between IT, business users, and data scientists to process and analyze large datasets beyond typical storage and processing capabilities.
L’IA, booster de votre activité : principes, usages & idéationScaleway
This document provides an overview of artificial intelligence and machine learning techniques including supervised and unsupervised learning using statistics on large datasets. It discusses applications such as recognizing text, clustering individuals/concepts, natural language generation, signal processing, product recommendations, deep learning using neural networks, challenges around business process modeling, data availability, explainability, and solutions including specialized hardware and frameworks. It also covers costs associated with AI hardware and examples of training duration and expenses.
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
Date: 14th November 2018
Location: Governance and MDM Theatre
Time: 10:30 - 11:00
Speaker: Mike Ferguson
Organisation: IBS
About: For most organisations today, data complexity has increased rapidly. In the area of operations, we now have cloud and on-premises OLTP systems with customers, partners and suppliers accessing these applications via APIs and mobile apps. In the area of analytics, we now have data warehouse, data marts, big data Hadoop systems, NoSQL databases, streaming data platforms, cloud storage, cloud data warehouses, and IoT-generated data being created at the edge. Also, the number of data sources is exploding as companies ingest more and more external data such as weather and open government data. Silos have also appeared everywhere as business users are buying in self-service data preparation tools without consideration for how these tools integrate with what IT is using to integrate data. Yet new regulations are demanding that we do a better job of governing data, and business executives are demanding more agility to remain competitive in a digital economy. So how can companies remain agile, reduce cost and reduce the time-to-value when data complexity is on the up?
In this session, Mike will discuss how companies can create an information supply chain to manufacture business-ready data and analytics to reduce time to value and improve agility while also getting data under control.
Intro to Artificial Intelligence w/ Target's Director of PMProduct School
This document provides an overview of artificial intelligence trends presented by Aarthi Srinivasan, Director of Product Management. It discusses growing investments in AI startups and by large corporations, with focus on automotive, healthcare, finance and education. Examples of applications include disease diagnosis, drug discovery, autonomous vehicles, facial and voice recognition. The presentation also provides guidance on structuring an AI product team and creating a machine learning-backed product vision.
ADV Slides: Data Curation for Artificial Intelligence StrategiesDATAVERSITY
This webinar will focus on the promise AI holds for organizations in every industry and every size, and how to overcome some of the challenge today of how to prepare for AI in the organization and how to plan AI applications.
The foundation for AI is data. You must have enough data to analyze and build models. Your data determines the depth of AI you can achieve — for example, statistical modeling, machine learning, or deep learning — and its accuracy. The increased availability of data is the single biggest contributor to the uptake in AI where it is thriving. Indeed, data’s highest use in the organization soon will be training algorithms. AI is providing a powerful foundation for impending competitive advantage and business disruption.
The document discusses key concepts related to data science including:
- The difference between data and information
- An overview of the volume, velocity, variety, and veracity (4 V's) of big data
- The steps involved in data analytics from data collection to model building
- Applications of data science in various industries like retail, manufacturing, and insurance
Business intelligence (BI) enables businesses to make fact-based decisions by aggregating data from various sources, enriching it with context, and presenting it in a way that informs decisions. BI technologies are becoming increasingly important and include predictive analytics, visualization, mobile BI, and cloud computing. Advanced analytics can uncover patterns in large datasets to predict customer behavior and business outcomes.
Big data comes from a variety of sources and in different formats. It is characterized by its volume, velocity, and variety. Organizations are using big data to gain business insights through analytics. This allows them to increase revenue, reduce costs, optimize processes, and manage risks. Examples of big data uses include marketing campaign analysis, customer segmentation, and fraud detection. Companies must overcome technological and organizational challenges to successfully leverage big data.
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Innovation Enterprise
Xerox presented on using big data and analytics to solve real-world problems. They discussed using transportation fare collection data to build models that infer passenger travel patterns and populate city dashboards. They also discussed working with educators to use student assessment data to provide real-time reports and recommendations to tailor instruction. Finally, they presented on using social media data and analytics to transform customer care services by identifying issues, engaging customers, and measuring engagement effectiveness.
S ba0881 big-data-use-cases-pearson-edge2015-v7Tony Pearson
IBM is a market leader in big data and analytics solutions. This session explains the basics of Big Data, with actual use cases of clients who have benefited from IBM solutions in this space, followed by architectures with IBM BigInsights, BigSQL, Platform Symphony and Spectrum Scale.
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
Similar to Introduction to (Big) Data Science (20)
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
1. Data Science Company
Introduction to (big) data science
Infofarm - Seminar
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
30/09/2014
2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda
• About us
• What is Data Science?
• Data Science in practice
– Models
– Tools
• Case study
3. About us
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Company
• Data Science and BigData startup
• Part of the Cronos group
– Largest indepent IT services supplier in Belgium
– Organized in limited-sized highly focused competence centers
– 3000+ Consultants
• Incubated at Xplore Group, within the context of:
– Java
– PHP
– e-commerce (Hybris, Intershop, Magento, DrupalCommerce, ...)
– Mobile development (iOS, Android, ...)
– Web development (HTML5, CSS3, ...)
5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Team
• Mixed skills team
– 2 Data Scientists
• Mathematics
• Statistics
– 4 BigData Consultants
– 1 Infra specialist
– n Cronos colleagues
with various background
• Certifications
– CCDH - Cloudera Certified Hadoop Developer
– CCAD - Cloudera Certified Hadoop Administrator
– OCJP – Oracle Certified Java Programmer
6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Focus
• Mission
– “Help our customers to excel in their business activities by
providing them with new information and insights of high
business value.
Identifying, extracting and using data of all types and origins;
exploring, correlating and using it in new and innovative ways in
order to extract meaning and business value from it.”
• Focus Domains
– Data Science
– Machine Learning
– Big Data
7. Introduction: what is Data Science?
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
What is Data Science?
• Data Science & Business decisions
• Data Science vs …
– Statistics
– Business Intelligence
– Big Data
• What can Data Science do for your business?
• The Data Science maturity model
9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business decisions
• Any business requires continuous decision taking
– Will we offer this customer a discount or not?
– Do we need to keep extra stock for product X?
– How do we answer this customer question?
– At which supplier do we buy this product?
– With which solution will be respond to this RFP?
– Do we need to replace device X?
– …
• The possible answers to these questions are based on prior
experience with the business
• Each decision can turn out to be the right or wrong one, business
knowledge should avoid picking the wrong ones
10. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business decisions
– However …
• Do you really know your business that well?
• Hasn’t it evolved in this fast-changing world?
• Are you sure your competitors aren’t making better decisions?
– You probably own a lot more information than you might realize!
• All your business processes are generating data which you can
use to your advantage!
• Quotes you made vs deals you won
• Historical sales records
• Web logs showing user activity
• Social media activity referring your brand/product
• Metering info on devices (internet of things)
• …
11. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Types of Data
– Proprietary data
• ERP, CRM, Orders, Customers, Products, etc…
– “Dark Data” – currently unused, maybe not even aware of
• Unknown, but present in the company
• Cost-efficient BigData tools might enable business cases using this data
– External data
• Websites, social media, open data, …
– Data still to be captured
• “If only we knew X or Y” …
– There might be a huge added value in “mashing up” proprietary
data with public/open data!
12. Business Knowledge vs Data Science
(Intuitive knowledge vs data driven decisions)
Business Knowledge
Acquired by experience
(assumed) insights
RISK: too high bias on past experience and gut feeling
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science
Complementary to business knowledge
Confirmative or new insights
Data-driven decision taking
RISK: too naive data intepretation,
disconnected from business
13. Business Knowledge vs Data Science
(Intuitive knowledge vs data driven decisions)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
14. Business decisions: marketing example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Example:
We want to send mailings about our new product
• Decisions to take:
– Which mail to send to which customers?
– We need customer segmentation!
• Risks in failing to do this correctly
– Missing opportunities (not informing customers)
– Annoying customers with irrelevant mailings (churn, reputation damage, …)
15. Business decisions: marketing example
• Business knowledge based approach
– “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female”
– But is this (still) true?
– E.g.: do we really want to send an ad of the new iPhone to a long-time Android
user because he’s a 30-something male customer?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
16. Business decisions: marketing example
• Data-driven approach: Can we identify different segments automatically?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
(machine learning!)
– WEB SERVER LOGS
Which customers have already looked at similar
product on our website?
– ORDER HISTORY
Which customers own complementary products?
– CRM INFORMATION
What is the typical profile of a customer that clicked
through on the last e-mail campaign for a similar product?
– …
• Business knowledge and Data Science become in- and output for
each other!
– Ideas/hypotheses and data to be examined should be identified from business
knowledge!
– A/B testing can be applied to test approaches and check results
– Let the data talk for itself! New business insights are generated
17. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Being a Data Scientist
• “Data Scientist – the most sexy job of the 21st century”
- Thomas H. Davenport
• Data Scientist: “A person who is better at statistics than any software
engineer and better at software engineering than any statistician”
- Josh Wills
18. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science = team work!
19. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Statistics
• Basic Statistics concepts
– Reliability and validity
– Probability
– Descriptive statistics and graphics
• Inferential statistics (and hypothesis testing)
– Probability distributions
– Populations and samples
– Confidence intervals
– Correlation
• Data Science
– Link with IT (tooling, scale, …)
– Data preparation & hacking (get data from databases, websites, …)
– Machine learning and automation
– Working interactively together with business
20. Data Science vs Business Intelligence
• Basic BI concepts: structuring data to report and query upon it
– DWH, OLAP, ETL processes
– Star- and snowflake schemas
– Query-oriented architectures
– Close to typical IT development cycle
• Data Science: working and experimenting with data to gain insights
– Exploratory working
– Work in a research cycle rather than development cycle
– Limited investment towards analysis that might or might not deliver
– Tools designed to avoid heavy ETL (loosely structured data)
– Eventually valuable analyses can be ported to BI systems
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
21. Data Science vs Business Intelligence
• Using tools that are designed to support exploratory
working
– Not requiring strict up-front schema design
– Allowing fast and cheap hypotheses testing
– Open up opportunities to quickly integrate many data sources
• Excel files, Text files, Word Documents
• Log files
• Relational databases
• Sensor data
• Timeseries data
• ...
• Integrations with online (OLTP) and analytical
(OLAP/BI) systems
– Typically for automating repetitive analysis and reporting outputs
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
22. Sampling Induction
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Big Data
• Process of statistical inference: sampling & induction
• BigData allows:
– N=ALL (avoid sampling errors)
• Sampling issues can be overcome by just processing ALL available data (process massive data)
– N=1 (avoid issues with non-homogenous datasets)
• Categorization becomes true personalisation: project towards ONE individual (calculate per item)
• Significance considerations are not applicable!
23. What can Data Science do for your business?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Extract meaning from data
– Using and combining data in ways it has never done before
– Finding patterns and correlations in data from all possible sources
– Detecting anomalies and changes in known patterns
• Transform data of various types into valuable information
– As a basis for management decisions
– As a basis for data products
– That can improve your business in any way
• Build and integrate Data Products
– Recommendation engines, Prediction models, Automated classification, …
• The key point is spotting opportunities to outperform your
competitors using any data available!
24. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Scientific cycle
Question
Hypothesis
Experiment
(data)
Conclusion
Analyse
results
• This is NOT a
development cycle!
• Experimentation vs
engineering
• Being a Science makes
that the outcome cannot
be predicted
• This makes it hard to
integrate in an IT
development process
25. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Scientific cycle
• Take small steps
• Formulate hypotheses
• Actually build things
• Apply A/B testing
• Even without success,
you learned something!
26. The Data Science maturity model
• Don’t run before you can walk: The Data Science Maturity model
Each level builds on the quality of the underlying step. It’s science, not magic …
– Start off by simply collecting the data you need (type, quantity, quality)
– Then report on your current business (confirmative analysis)
– Discover new and valuable information (exploratory analysis)
– Build and test prediction models (predictive analysis)
– Steer your business based on advise output from your predictions (data-driven)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Collect
Describe
Discover
Predict
Advise
27. The Data Science maturity model
Phase Actions Examples in commerce
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Collect
Logging information
Gathering data from different sources
Logging user actions on a website
Using loyalty cards to id customers
Describe
Explorative Data Analysis
Basic analytical functions
Checking quantity and quality of data
Typical reporting
Correlating data over sources
Discover
Finding correlations
Building models
Finding similarly behaving customers
Predict
Building prediction models
Formulating expectations for the
future based on past info
Predict sales figures for a new product
Predict whether a certain customer
will or will not buy a certain product
Advise
Use prediction models to evaluate
decision possibilities and pick the best
Target advertising to the right
customer groups to optimize revenue
28. Data Science in practice
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
35. Modeling methods & statistics
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Basic patterns
– Recommendations
Based on known taste, propose items that might be liked as well
– Clustering
Detecting correlation groups in data without using pre-defined
segmentation based on business knowledge
– Classification
Automated labeling, acceptance/rejection of data based on
probability models
• Supervised & unsupervised learning methods
– k-means, naive bayes, n-nearest neighborhood, random forrests,
logistic regression, A priori, ...
36. Modeling methods: Decision Tree
• Query: which kind of fruit am I looking at
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
– More general: image recognition
• Clean your data
– What to do with missing values?
• Insert average value
• Insert special value
• Delete data
– What to do with outliers?
• Wrong data?
37. Modeling methods: Decision Tree
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Find most decisive variable
– Categorical variable: One leaf for each variable or one leaf for a
group of categories
– Numerical variable: find best cut-off(s)
Query
Color
Green Yellow Red
38. Modeling methods: Decision Tree
• For each leave, repeat the process:
Size is actually numerical: find size cut offs
Yellow
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Query
Color
Size
Green
Big
Medium
Small
Shape
Roun
d
Thin
Size
Red
Medium Small
39. Modeling methods: Decision Tree
Yellow
Medium
Small
Sweet
Sour
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Query
Color
Size
Green
Big
Water-melon
Medium
Green
apple
Small
Grapes
Shape
Size
Round
Big
Grape-fruit
Mediu
m
Lemon
Banana
Thin
Size
Red
apple
Try it
Cherry
Grape
40. Modeling methods: Decision Tree - Distributed
• A big advantage of the big data tools are the Distributed
processing power (run processes in parallel)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Build your decision tree
– Each leaf can be processed by another node
– All your data should still be available to every mapper
• Upgrading your decision tree
– Bagging trees (sampling your data)
– Random Forest (sampling your variables)
– Every mapper should only read a part of your data
– Still in general better results than a decision tree
41. Modeling methods: Decision Tree
• QUESTION: Can we predict whether a customer will place an
Date_added
> 1.5
Hour_added
> 16.29
0.06 Date_added
< 5.113
0.1136 0.1829
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
order during this web session?
• Modeling (data mining)
– Input: historical surfing information
– Decision tree algorithm
• Loop at historical data
• Find most decisive variable
• For each leaf, repeat
– Avoid overfitting!
• Runtime usage
– Pass current info in tree model
– Allow certain discounts to increase conversion?
– Put user on checkout or in-store after putting product in basket?
0.3273
42. Modeling methods: Naive Bayes
• QUESTION: Will I play tennis today?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Start with labeled data from the past
Again clean your data!
• Often used with plain text
• Assumes that each variable is independent from all others
• Named after Bayes rule (statistics)
43. Modeling methods: Naive Bayes
Day • Outlook Temperature Humidity Wind PlayTennis
D1 • Sunny Hot High Weak No
D2 • Sunny Hot High Strong No
D3 • Overcast Hot High Weak Yes
D4 • Rain Mild High Weak Yes
D5 • Rain Cool Normal Weak Yes
D6 • Rain Cool Normal Strong No
D7 • Overcast Cool Normal Strong Yes
D8 • Sunny Mild High Weak No
D9 • Sunny Cool Normal Weak Yes
D10 • Rain Mild Normal Weak Yes
D11 • Sunny Mild Normal Strong Yes
D12 • Overcast Mild High Strong Yes
D13 • Overcast Hot Normal Weak Yes
D14 • Rain Mild High Strong No
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
44. Modeling methods: Naive Bayes
• Consider PlayTennis problem and new instance
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
(sun, cool, high, strong)
45. Modeling methods: Naive Bayes
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Estimate parameters
– P(yes) = 9/14 P(no) = 5/14
– P(Wind=strong|yes) = 3/9
– P(Wind=strong|no) = 3/5
– …
• We have
P(y)P(sun|y)P(cool|y)P(high|y)P(strong|y) = 0.005
P(n)P(sun|y)P(cool|n)P(high|n)P(strong|n) = 0.021
• Therefore this new instance is classified to “no”
46. Modeling methods: Naive Bayes - distributed
• Vectorisation of trainining data (more or less wordcount) can
easily be distributed:
– Each text to one mapper
– Even when dealing with a large text cut your text in to peaces
– Every small block of data only read once by one mapper
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Vectorisation of your new instance
• Actual prediction is a multiplication of all conditional chances
also calculation of prediction easy to distribute
47. Modeling methods: Naive Bayes
• QUESTION: Can we route incoming questions (free text) to the
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
right person/department?
• Modeling (data mining)
– Input: historical information questions and handling person/department
– Naive bayes algorithm
• For each word or n-gram (2 or 3 words) – count occurences per file
• Very valuable are words with high frequency in a single document
• Very valuable are words only used in a small number of documents
• Remove stopwords, generic words, etc…
• Runtime usage
– Vectorize incoming document (which words/n-grams occur how many
times?)
– Predict category based on comparison with historical documents
48. Modeling methods: k-means Clustering
• QUESTION: Which countries have the same type of food
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
consumption
• Your data is not labeled!
• You define labels for your clusters after applying the cluster
algorithm
• Choose the number of clusters you are expecting
– Try for different number of clusters
– Run an algorithm to decide the optimal number of clusters
• Plot your final results mapped on your principal components
50. Modeling methods: k-means Clustering
• Define a metric: take every variable into account as much as all
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
other variables
• Create random starting points (as many as clusters you expect)
• Assign each point to the closest center (or starting) point
• Calculate the center of each cluster
• Iterate the previous two steps
53. Modeling methods: k-means Clustering
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
"cluster 1"
Country RedMeat Fish Fr.Veg
Albania 10.1 0.2 1.7
Bulgaria 7.8 1.2 4.2
Romania 6.2 1.0 2.8
Yugoslavia 4.4 0.6 3.2
"cluster 2"
Country RedMeat Fish Fr.Veg
Denmark 10.6 9.9 2.4
Finland 9.5 5.8 1.4
Norway 9.4 9.7 2.7
Sweden 9.9 7.5 2.0
"cluster 3"
Country RedMeat Fish Fr.Veg
Czechoslovakia 9.7 2.0 4.0
E Germany 8.4 5.4 3.6
Hungary 5.3 0.3 4.2
Poland 6.9 3.0 6.6
USSR 9.3 3.0 2.9
[
"cluster 4"
Country RedMeat Fish Fr.Veg
Austria 8.9 2.1 4.3
Belgium 13.5 4.5 4.0
France 18.0 5.7 6.5
Ireland 13.9 2.2 2.9
Netherlands 9.5 2.5 3.7
Switzerland 13.1 2.3 4.9
UK 17.4 4.3 3.3
W Germany 11.4 3.4 3.8
"cluster 5"
Country RedMeat Fish Fr.Veg
Greece 10.2 5.9 6.5
Italy 9.0 3.4 6.7
Portugal 6.2 14.2 7.9
Spain 7.1 7.0 7.2
54. Modeling methods: k-means Clustering - distributed
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Calculate conditional chances
– Every mapper only needs one variable
• Assigning points to clusters:
– All centers in distributed cache
– Rest of the data only read once by one mapper
– Calculate distances and assign to the closest center point
• Update center points
– One mapper for each cluster
55. Modeling methods: k-means Clustering
• QUESTION: In which different segments can we split our
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
customer base?
• Modeling (data mining)
– Input: any information on the customers (CRM, ERP, Social Media, …)
– Very important to find columns to use (requires business knowledge to
formulate hypotheses!)
– K-means clustering algorithm
• Define a “distance” formula to calculate how close two customers are to
each other
• Define starting points for each cluster center
• Iterate and re-allocate customers to a cluster, move cluster centers
• Runtime usage
– Quickly check the cluster in which a new customer could be residing
56. Modeling methods: A priori
• QUESTION: Which books might be interesting for you, knowing
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
which books you have read?
• Modeling (data mining)
– Input: all titles of books someone has read
– Make sure that same books have same titles (e.g.: drop edition from
title)
– A priori algorithm
• Make baskets of read books, labeled with the reader
• Identify common occuring books
• Tweak your recommendation rules:
– Chose big enough support
– Confidence of recommendations can be calculated
– The bigger the lift, the more valuable your recommendation might be for the reader
• Runtime usage
– Check if a subset of the books occur as left-hand-side of a rule
57. Modeling methods: A priori
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Data consists of books bought online
• There were more than 40000 users buying more than one book (If they only
bought one book, they are not useful to make your model)
• In total they bought more than 220000 books
• Notice the permutations in the rules
• As you might expect, sequel books are bought together
59. Modeling methods: A priori - distributed
• Make list of books bought together (training data)
– Similar to n-grams (Naïve Bayes)
– Every customer only read once by one mapper
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Make recommendations
– Every mapper handles a number of rules
60. Modeling methods: A priori
• QUESTION: Which adds can I show on a website?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Modeling (data mining)
– Input: All visited links, all bought items, …
– Decide what you think is important: you want to show items others were
also interested in, items others also bought, ….
– A priori algorithm
• Find items which occur together
• Define your support, confidence and lift you want
• Runtime usage
– Check if a subset of the visited links occur as a left hand side of a rule
61. Case study
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye