The document provides an overview of big data and social analytics, covering topics such as the definition of big data, machine learning, common big data tools like Hadoop and Spark, programming languages for data science like Python and R, and packages for machine learning in Python. It also discusses practical applications of big data and introduces exercises for hands-on practice with tools like NumPy in Jupyter notebooks.
This seminar discusses nursing informatics and the use of technology in research. It defines key terms like data, databases, and knowledge discovery in databases (KDD). It explains that KDD involves extracting hidden knowledge from large datasets through the use of data mining techniques. The seminar also outlines the various stages of the KDD process and discusses how technology can enhance research activities like data collection and analysis, communication and collaboration, and information storage and retrieval.
The document discusses big data processing systems. It begins with an overview of big data and its evolution due to technologies like IoT, social media, and smart cars. This has led to an exponential increase in data volume and variety, including structured, semi-structured and unstructured data. Traditional databases cannot handle this type and size of data. The document then introduces Hadoop as an open source framework to process large, diverse datasets across clusters. It uses HDFS for storage and MapReduce for parallel processing of data stored in HDFS. Hadoop provides scalable solutions to the problems of storing huge, growing datasets and processing complex, diverse data faster.
This document outlines a course on knowledge acquisition in decision making, including the course objectives of introducing data mining techniques and enhancing skills in applying tools like SAS Enterprise Miner and WEKA to solve problems. The course content is described, covering topics like the knowledge discovery process, predictive and descriptive modeling, and a project presentation. Evaluation includes assignments, case studies, and a final exam.
The document discusses data mining and knowledge discovery from large data sets. It begins by defining the terms data, information, knowledge, and wisdom in a hierarchy. It then discusses why data mining is needed due to the explosive growth of data from various sources. Data mining is defined as the non-trivial extraction of implicit and potentially useful knowledge from large data sets. The knowledge discovery process involves identifying a problem, mining data to transform it into actionable information, acting on the information, and measuring the results. The document outlines different types of data that can be mined, including structured, transactional, time-series, spatial, multimedia, and web data. Common data mining tasks are also described such as classification, prediction, clustering,
This document introduces data mining concepts and techniques. It defines data mining as the process of discovering interesting patterns from large amounts of data. The document outlines several data mining functionalities including classification, clustering, association rule mining, and outlier detection. It also discusses popular data mining algorithms, major issues in data mining, and provides a brief history of the data mining field and community.
classification algorithms, decision tree, naive bayes, back propagation, KNN,TU, BIM 8th semester Data mining and data warehousing Slide by Tekendra Nath Yogi
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
Data scopes describe the process of data gathering, cleaning and combining in digital humanities research, which is too often considered as mere preparation that is not part of research, and is mostly not described in scholarly communications. We argue that scholars need to be more aware of the intellectual effort of this process and make it more transparent
This seminar discusses nursing informatics and the use of technology in research. It defines key terms like data, databases, and knowledge discovery in databases (KDD). It explains that KDD involves extracting hidden knowledge from large datasets through the use of data mining techniques. The seminar also outlines the various stages of the KDD process and discusses how technology can enhance research activities like data collection and analysis, communication and collaboration, and information storage and retrieval.
The document discusses big data processing systems. It begins with an overview of big data and its evolution due to technologies like IoT, social media, and smart cars. This has led to an exponential increase in data volume and variety, including structured, semi-structured and unstructured data. Traditional databases cannot handle this type and size of data. The document then introduces Hadoop as an open source framework to process large, diverse datasets across clusters. It uses HDFS for storage and MapReduce for parallel processing of data stored in HDFS. Hadoop provides scalable solutions to the problems of storing huge, growing datasets and processing complex, diverse data faster.
This document outlines a course on knowledge acquisition in decision making, including the course objectives of introducing data mining techniques and enhancing skills in applying tools like SAS Enterprise Miner and WEKA to solve problems. The course content is described, covering topics like the knowledge discovery process, predictive and descriptive modeling, and a project presentation. Evaluation includes assignments, case studies, and a final exam.
The document discusses data mining and knowledge discovery from large data sets. It begins by defining the terms data, information, knowledge, and wisdom in a hierarchy. It then discusses why data mining is needed due to the explosive growth of data from various sources. Data mining is defined as the non-trivial extraction of implicit and potentially useful knowledge from large data sets. The knowledge discovery process involves identifying a problem, mining data to transform it into actionable information, acting on the information, and measuring the results. The document outlines different types of data that can be mined, including structured, transactional, time-series, spatial, multimedia, and web data. Common data mining tasks are also described such as classification, prediction, clustering,
This document introduces data mining concepts and techniques. It defines data mining as the process of discovering interesting patterns from large amounts of data. The document outlines several data mining functionalities including classification, clustering, association rule mining, and outlier detection. It also discusses popular data mining algorithms, major issues in data mining, and provides a brief history of the data mining field and community.
classification algorithms, decision tree, naive bayes, back propagation, KNN,TU, BIM 8th semester Data mining and data warehousing Slide by Tekendra Nath Yogi
Data Scopes - Towards transparent data research in digital humanities (Digita...Marijn Koolen
Data scopes describe the process of data gathering, cleaning and combining in digital humanities research, which is too often considered as mere preparation that is not part of research, and is mostly not described in scholarly communications. We argue that scholars need to be more aware of the intellectual effort of this process and make it more transparent
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses Chapter 5 from the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of basic concepts such as frequent patterns and association rules. It also describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to frequent pattern mining.
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
Data mining and Machine learning explained in jargon free & lucid language.
By reading one can get some intuition about what data mining and machine learning is all about
APPLY IT IN THEIR OWN WORK
This document discusses the evolution of database technology and data mining. It provides a brief history of databases from the 1960s to the 2010s and their purposes over time. It then discusses the motivation for data mining, noting the explosion in data collection and need to extract useful knowledge from large databases. The rest of the document defines data mining, outlines the basic process, discusses common techniques like classification and clustering, and provides examples of data mining applications in industries like telecommunications, finance, and retail.
This document provides an overview of data mining concepts and techniques. It defines data mining as the extraction of interesting and useful patterns from large amounts of data. The document outlines several potential applications of data mining, including market analysis, risk analysis, and fraud detection. It also describes the typical steps involved in a data mining process, including data cleaning, pattern evaluation, and knowledge presentation. Finally, it discusses different data mining functionalities, such as classification, clustering, and association rule mining.
This document provides information about a course on data warehousing and data mining, including:
1. It outlines the course syllabus which covers the basics of data warehousing, data preprocessing, association rules, classification and clustering, and recent trends in data mining.
2. It describes the 5 units that make up the course, including an overview of the topics covered in each unit such as data warehouse architecture, data integration, decision trees, and applications of data mining.
3. It lists two textbooks and four references that will be used for the course.
Abstract: Knowledge has played a significant role on human activities since his development. Data mining is the process of
knowledge discovery where knowledge is gained by analyzing the data store in very large repositories, which are analyzed
from various perspectives and the result is summarized it into useful information. Due to the importance of extracting
knowledge/information from the large data repositories, data mining has become a very important and guaranteed branch of
engineering affecting human life in various spheres directly or indirectly. The purpose of this paper is to survey many of the
future trends in the field of data mining, with a focus on those which are thought to have the most promise and applicability
to future data mining applications.
Keywords: Current and Future of Data Mining, Data Mining, Data Mining Trends, Data mining Applications.
1) Data mining involves extracting hidden patterns from large datasets to discover useful information. It is an interdisciplinary field drawing from statistics, machine learning, database technology and more.
2) The overall goal is to extract information and transform it into an understandable structure. This includes data cleaning, integration, selection, transformation, mining patterns, and evaluating/presenting the results.
3) Data mining is used for applications like market analysis, risk analysis, fraud detection and more, across domains like business, science, health, and society. It has the potential to provide insights from vast amounts of accumulated data.
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
The document discusses the rise of big data and its impact on data centers. It defines what big data is and what it is not, providing examples of big data sources and uses. It also explores how the concept of a data center is evolving, as they must adapt to support new big data workloads. Traditional data center designs are no longer sufficient and distributed, modular, and software-defined approaches are needed to efficiently manage large and growing volumes of data.
The document provides an overview of the data mining concepts and techniques course offered at the University of Illinois at Urbana-Champaign. It discusses the motivation for data mining due to abundant data collection and the need for knowledge discovery. It also describes common data mining functionalities like classification, clustering, association rule mining and the most popular algorithms used.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.
The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
This document provides an introduction to data mining and knowledge discovery. It discusses how the increasing amount of data collected has led to a "data flood" that makes it difficult for humans to analyze on their own. Data mining aims to find useful patterns in large datasets through techniques like classification, clustering, and association rule mining. The document outlines the data mining process and gives examples of data mining applications in domains such as search engines, marketing, biology, fraud detection, and security. It also addresses some controversies around using data mining for applications like threat detection.
How Data Science Can Grow Your Business?Noam Cohen
What is data science?
How is it used in the industry?
DS methodology and life cycle
Who are the Data-team members?
Limitations and caveats
(**Google slides upload didn't go well)
The document provides an overview of data mining concepts including association rules, classification, and clustering algorithms. It introduces data mining and knowledge discovery processes. Association rule mining aims to find relationships between variables in large datasets using the Apriori and FP-growth algorithms. Classification algorithms build a model to predict class membership for new records based on a decision tree. Clustering algorithms group similar records together without predefined classes.
This document provides an outline and overview of key concepts related to data mining. It begins with an introduction to data mining and related tasks such as classification, clustering, and association rule mining. It then discusses concepts that are related to data mining, such as databases, information retrieval, and dimensional modeling. Finally, it outlines common data mining techniques including statistics, similarity measures, decision trees, and neural networks. The overall goal is to introduce the main components and approaches used in data mining.
Data mining techniques are used to analyze large datasets and discover hidden patterns. There are three main types of data mining techniques: supervised, unsupervised, and semi-supervised learning. Supervised learning uses labeled training data to learn relationships between inputs and outputs. Unsupervised learning looks for patterns in unlabeled data. Semi-supervised learning uses some labeled and mostly unlabeled data. The knowledge discovery in databases (KDD) process is a nine step method for applying data mining techniques which includes data selection, preprocessing, transformation, mining, and interpretation.
Data mining involves analyzing large amounts of data to discover patterns that can be used for purposes such as increasing sales, reducing costs, or detecting fraud. It allows companies to better understand customer behavior and develop more effective marketing strategies. Common data mining techniques used by retailers include loyalty programs to track purchasing patterns and target customers with personalized coupons. Data mining software uses techniques like classification, clustering, and prediction to analyze data from different perspectives and extract useful information and patterns.
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
Big Data Tutorial - Marko Grobelnik - 25 May 2012Marko Grobelnik
The document discusses big data, including what it is, key factors enabling its growth like increased storage and processing power, and techniques for handling big data like distributed processing and NoSQL databases. It provides examples of tools and applications for big data and discusses challenges like ensuring patterns found in big data analysis are actually meaningful.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses Chapter 5 from the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of basic concepts such as frequent patterns and association rules. It also describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to frequent pattern mining.
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
Data mining and Machine learning explained in jargon free & lucid language.
By reading one can get some intuition about what data mining and machine learning is all about
APPLY IT IN THEIR OWN WORK
This document discusses the evolution of database technology and data mining. It provides a brief history of databases from the 1960s to the 2010s and their purposes over time. It then discusses the motivation for data mining, noting the explosion in data collection and need to extract useful knowledge from large databases. The rest of the document defines data mining, outlines the basic process, discusses common techniques like classification and clustering, and provides examples of data mining applications in industries like telecommunications, finance, and retail.
This document provides an overview of data mining concepts and techniques. It defines data mining as the extraction of interesting and useful patterns from large amounts of data. The document outlines several potential applications of data mining, including market analysis, risk analysis, and fraud detection. It also describes the typical steps involved in a data mining process, including data cleaning, pattern evaluation, and knowledge presentation. Finally, it discusses different data mining functionalities, such as classification, clustering, and association rule mining.
This document provides information about a course on data warehousing and data mining, including:
1. It outlines the course syllabus which covers the basics of data warehousing, data preprocessing, association rules, classification and clustering, and recent trends in data mining.
2. It describes the 5 units that make up the course, including an overview of the topics covered in each unit such as data warehouse architecture, data integration, decision trees, and applications of data mining.
3. It lists two textbooks and four references that will be used for the course.
Abstract: Knowledge has played a significant role on human activities since his development. Data mining is the process of
knowledge discovery where knowledge is gained by analyzing the data store in very large repositories, which are analyzed
from various perspectives and the result is summarized it into useful information. Due to the importance of extracting
knowledge/information from the large data repositories, data mining has become a very important and guaranteed branch of
engineering affecting human life in various spheres directly or indirectly. The purpose of this paper is to survey many of the
future trends in the field of data mining, with a focus on those which are thought to have the most promise and applicability
to future data mining applications.
Keywords: Current and Future of Data Mining, Data Mining, Data Mining Trends, Data mining Applications.
1) Data mining involves extracting hidden patterns from large datasets to discover useful information. It is an interdisciplinary field drawing from statistics, machine learning, database technology and more.
2) The overall goal is to extract information and transform it into an understandable structure. This includes data cleaning, integration, selection, transformation, mining patterns, and evaluating/presenting the results.
3) Data mining is used for applications like market analysis, risk analysis, fraud detection and more, across domains like business, science, health, and society. It has the potential to provide insights from vast amounts of accumulated data.
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
The document discusses the rise of big data and its impact on data centers. It defines what big data is and what it is not, providing examples of big data sources and uses. It also explores how the concept of a data center is evolving, as they must adapt to support new big data workloads. Traditional data center designs are no longer sufficient and distributed, modular, and software-defined approaches are needed to efficiently manage large and growing volumes of data.
The document provides an overview of the data mining concepts and techniques course offered at the University of Illinois at Urbana-Champaign. It discusses the motivation for data mining due to abundant data collection and the need for knowledge discovery. It also describes common data mining functionalities like classification, clustering, association rule mining and the most popular algorithms used.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.
The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
This document provides an introduction to data mining and knowledge discovery. It discusses how the increasing amount of data collected has led to a "data flood" that makes it difficult for humans to analyze on their own. Data mining aims to find useful patterns in large datasets through techniques like classification, clustering, and association rule mining. The document outlines the data mining process and gives examples of data mining applications in domains such as search engines, marketing, biology, fraud detection, and security. It also addresses some controversies around using data mining for applications like threat detection.
How Data Science Can Grow Your Business?Noam Cohen
What is data science?
How is it used in the industry?
DS methodology and life cycle
Who are the Data-team members?
Limitations and caveats
(**Google slides upload didn't go well)
The document provides an overview of data mining concepts including association rules, classification, and clustering algorithms. It introduces data mining and knowledge discovery processes. Association rule mining aims to find relationships between variables in large datasets using the Apriori and FP-growth algorithms. Classification algorithms build a model to predict class membership for new records based on a decision tree. Clustering algorithms group similar records together without predefined classes.
This document provides an outline and overview of key concepts related to data mining. It begins with an introduction to data mining and related tasks such as classification, clustering, and association rule mining. It then discusses concepts that are related to data mining, such as databases, information retrieval, and dimensional modeling. Finally, it outlines common data mining techniques including statistics, similarity measures, decision trees, and neural networks. The overall goal is to introduce the main components and approaches used in data mining.
Data mining techniques are used to analyze large datasets and discover hidden patterns. There are three main types of data mining techniques: supervised, unsupervised, and semi-supervised learning. Supervised learning uses labeled training data to learn relationships between inputs and outputs. Unsupervised learning looks for patterns in unlabeled data. Semi-supervised learning uses some labeled and mostly unlabeled data. The knowledge discovery in databases (KDD) process is a nine step method for applying data mining techniques which includes data selection, preprocessing, transformation, mining, and interpretation.
Data mining involves analyzing large amounts of data to discover patterns that can be used for purposes such as increasing sales, reducing costs, or detecting fraud. It allows companies to better understand customer behavior and develop more effective marketing strategies. Common data mining techniques used by retailers include loyalty programs to track purchasing patterns and target customers with personalized coupons. Data mining software uses techniques like classification, clustering, and prediction to analyze data from different perspectives and extract useful information and patterns.
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
Big Data Tutorial - Marko Grobelnik - 25 May 2012Marko Grobelnik
The document discusses big data, including what it is, key factors enabling its growth like increased storage and processing power, and techniques for handling big data like distributed processing and NoSQL databases. It provides examples of tools and applications for big data and discusses challenges like ensuring patterns found in big data analysis are actually meaningful.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
The document provides a general introduction to artificial intelligence (AI), machine learning (ML), deep learning (DL), and data science (DS). It defines each term and describes their relationships. Key points include:
- AI is the ability of computers to mimic human cognition and intelligence.
- ML is an approach to achieve AI by having computers learn from data without being explicitly programmed.
- DL uses neural networks for ML, especially with unstructured data like images and text.
- DS involves extracting insights from data through scientific methods. It is a multidisciplinary field that uses techniques from ML, DL, and statistics.
This document discusses data science career paths and the role of a data scientist. It defines data science as the scientific process of transforming data into insights to make better decisions. Data scientists are skilled at statistics, software engineering, machine learning, and communicating findings. The document outlines common data science career paths including roles in fraud detection analyzing social media analytics. It also lists important skills for data scientists such as data mining, machine learning, statistics, visualization, programming, and working with big data. Finally, it provides an example of tasks a data scientist might complete in a typical day.
Data and Analytics Career Paths, Presented at IEEE LYC'19.
About Speaker:
Ahmed Amr is a Data/Analytics Engineer at Rubikal, where he leads, develops, and creates daily data/analytics operations, which includes data ingestion , data streaming, data warehousing, and analytical dashboards. Ahmed is graduated from Computer Engineering Department, Alexandria University; and he is currently pursuing his MSc degree in Computer Science, AAST. Professionally, Ahmed worked with Egyptian/US startups such as (Badr, Incorta, WhoKnows) to develop their data/analytics projects. Academically, Ahmed worked as a Teaching Assistant in CS department, AAST. Ahmed helps software companies to develop robust data engineering infrastructure, and powerful analytical insights.
References:
1) https://www.datacamp.com/community/tutorials/data-science-industry-infographic
2) Analytics: The real-world use of big data, IBM, Executive Report
The document discusses data preprocessing for machine learning models in IoT applications. It covers:
1) The different types of data generated by IoT devices and sensors, including nominal, numeric, discrete, and continuous data.
2) The steps for data preprocessing - cleaning incomplete, noisy, and inconsistent data; integrating data from multiple sources; transforming data through normalization, aggregation, and dimensionality reduction; and reducing data volume.
3) The importance of data understanding, availability of data and computing resources, and the nature of the problem for selecting appropriate machine learning models.
4) Architectures for IoT machine learning, including conventional cloud-based approaches and newer TinyML approaches that optimize models for execution directly on
Alexey Zinoviev presented this paper on Second Thumbtack Technology Expert Day.
This paper covers next topics: Data Mining, Machine Learning, Octave, R language
YouTube: http://youtu.be/kGIP6XeWiaA
How can I become a data scientist? What are the most valuable skills to learn for a data scientist now? Could I learn how to be a data scientist by going through online tutorials? What does a data scientist do?
These are only some of the questions that are being discussed online, on blogs, on forums and on knowledge-sharing platforms like Quora.
Let me share the Beginner's Guide to Data Science which will be really helpful to you.
Also Checkout: http://bit.ly/2Mub6xP
This document provides an introduction to data science. It discusses what data science is, the data life cycle, key domains that benefit from data science and why Python is well-suited for data science. It also summarizes several important Python libraries for data science - Pandas for data analysis, NumPy for scientific computing, Matplotlib and Seaborn for data visualization, and introduces machine learning concepts like supervised and unsupervised learning. Example algorithms like linear regression and K-means clustering are also covered.
This document provides an overview of big data and discusses key concepts. It begins by defining big data and noting the increasing volume, velocity and variety of data being created. It then covers the big data landscape including storage models and technologies like Hadoop, analytics techniques like machine learning, and visualization. Finally, it discusses business uses cases and how big data is impacting industries and creating new business models through insights gained from data.
This document provides an overview of data analytics processes for learning and academic analytics projects. It discusses key data dimensions including computing, location, time, activity, physical conditions, resources, user attributes, and relations. It then covers applications of analytics for learners and teachers to monitor learning and improve performance. The document outlines the stages of an extract-transform-load data processing workflow. Finally, it discusses different methods for knowledge discovery including prediction, structure discovery through clustering and factor analysis, and relationship mining through association rules, correlations, sequential patterns and causal analysis.
The document provides an introduction to data science at scale and distributed thinking. It discusses the motivation for data science at scale due to increasing data volumes, varieties, and velocities. It distinguishes between data science, which focuses on accuracy, and data engineering, which focuses on scale, performance, and reliability. The document then provides a crash course on data engineering concepts like distributed computation and the SMACK stack. It introduces Spark as a framework that can scale data processing. Finally, it discusses probabilistic algorithms as an approach for processing large datasets that may be inexact but use less resources than exact algorithms.
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
Data science is an interdisciplinary field that uses scientific methods to extract knowledge from data. It involves collecting, cleaning, analyzing and modeling data to discover useful insights. The data science process includes defining problems, collecting and preparing data, exploring and analyzing data, building models, and deploying models. Data comes in many forms like structured, unstructured, natural language, graphs and images. Descriptive statistics are used to summarize key aspects of data through measures of central tendency and variability.
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
This talk provides an engineering perspective on privacy protection. The intended audience is architects, developers, data scientists, and engineering managers that build applications handling user data. We highlight topics that require attention at an early design stage, and go through pitfalls and potentially expensive architectural mistakes. We describe a number of technical patterns for complying with privacy regulations without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
1) Introduction to the key Big Data concepts
1.1 The Origins of Big Data
1.2 What is Big Data ?
1.3 Why is Big Data So Important ?
1.4 How Is Big Data Used In Practice ?
2) Introduction to the key principles of Big Data Systems
2.1 How to design Data Pipeline in 6 steps
2.2 Using Lambda Architecture for big data processing
3) Practical case study : Chat bot with Video Recommendation Engine
4) FAQ for student
Introduction to machine learning and applications (1)Manjunath Sindagi
This document provides an introduction to machine learning including definitions, applications, and examples. It discusses the types of machine learning including supervised learning using examples of regression and classification. Unsupervised learning including clustering is also covered. The steps to solve a machine learning problem are outlined including feature selection, scaling, model selection, parameter selection, cost functions, gradient descent, and evaluation. Career opportunities in data science are discussed along with challenges such as data acquisition.
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
Similar to Big Data & Social Analytics presentation (20)
Dirty data? Clean it up! - Datapalooza Denver 2016
Big Data & Social Analytics presentation
1. Big Data & Social
Analytics
Gustavo Souto, M.Sc.
2. Summary
● Part One: Theory about Big Data and Analytics
○ About Data
○ What’s Big Data?
○ Machine Learning
○ Big Data tools
● Part Two: Practice
○ Languages for Data Science
○ Main python packages for Machine Learning
○ Let's get some practice
● Part Three: Conclusions
○ Let’s recap!
○ Next steps
○ References
3. About me
Gustavo Souto
I am Ph.D. student at Federal University of Rio Grande
do Norte (UFRN). I have started the Ph.D. degree at
Technische Universität Dortmund (TU-Dortmund) in
Germany.
I also hold a Master's degree in Computer Engineering
from UFRN.
Topics of interest: Machine learning, Big Data, Data
stream, anomaly detection
4. About Data
Understanding the data
● How much data do we create every day?
○ It is 2.5 quintillion bytes of data [1].
● How about in 1992?
○ 100GB day.
● Let’s check out the table of data size to better understand the data.
5. About Data
Name Size Example
Byte 8 bits A single character
Kilobyte 1000 Bytes A compressed doc. Img. page. (50Kb)
Megabyte 1000 Kilobyte Digital book (5Mb)
Gigabyte 1000 Megabyte One recorded symphony in HiFi.
Terabyte 1000 Gigabyte Automated tape robot.
Petabyte 1000 Terabyte All academic libraries in EU (2PB)
Exabyte 1000 Petabyte All words said by all humanity so far. (5EX)
Zettabyte 1000 Exabyte
Yottabyte 1000 Zettabyte Current storage capacity of Internet.
6. About Data
Where do we find data?
● Text:
○ Documents and reports.
● Databases:
○ MySQL, PostgreSQL, Oracle etc.
● Geographic:
○ GPS and Maps.
● Social Media:
○ Facebook, Twitter, Instagram etc.
● Archives:
○ JSON, CSV, XML etc.
● APIs:
● Images and videos
7. About Data
About internal structure of data
● Structured data
○ The data follows a well-defined structure, that is, displayed in titled columns and
rows.
○ Examples: Datasets from MySQL, PostgreSQL, Oracles etc.
● Semistructured data
○ It is a type of structured data, but lacks the strict data model [2].
○ Examples: Emails, XML, JSON.
● Unstructured data
○ The data does not follow a structure.
○ Examples: Freedom text, comments fields, tweets.
8. What’s Big Data
The 4 V’s
● Volume
○ A massive volume of both structured and unstructured data.
● How much data is considered ‘Big Data’?
○ Terabytes, Petabytes, and so on.
○ Driscoll created a simple table that defines borders [3].
9. What’s Big Data
The 4 V’s
● Variety
○ Try to capture all of the data that pertains to our decision-making process [4].
● Data complements
○ Analyze different sources to drive decisions.
○ Most of the data out there is semistructured or unstructured.
■ Example: Customer call center (voice classification + customer’s record data
+ transaction history).
“This is the third outage I’ve had in one week!”
10. What’s Big Data
The 4 V’s
● Velocity
○ The rate at which data arrives at the enterprise and is processed or well
understood [4].
● “How long does it take you to do something about it or know it has even
arrived?” [4]
○ One pass processing.
■ Example: An enterprise facing a network security problem.
○ Data stream
■ Example: Netflix, Youtube, Sensors.
○ Velocity is one of the most overlooked areas in the Big Data.
11. What’s Big Data
The 4 V’s
● Veracity
○ It refers to the quality of data, or trustworthiness [4].
● Data transformations
○ Remove noise.
● Big spam
○ Untrustworthy information (noise).
○ Example: Some Tweets.
17. What’s Big Data
How about ethics?
● You are responsible for the data!
○ Be careful when you deal with them.
● Risks
○ (Think first!) Do the results bring risk to anyone?
■ Understand the risks of your decisions.
● Personally Identifiable Information: PII
○ Information that identify one person from another.
○ The data must anonymized.
○ Red flags: address, telephone number, geolocalization, codes, slangs.
18. What’s Big Data
Everything is connected
● Internet of Things (IoT)
○ It is an internetworking of physical devices,
vehicles, building, software, sensors,
actuators (other items), and network
connectivity that enables these objects to
collect and exchange data [5].
19. What’s Big Data
Business Intelligence (BI)
● Definition
○ It is a set of techniques and tools for the acquisition and transformation of raw data
into meaningful and useful information for business analysis purposes [6].
● Common functionalities
○ Reporting
○ Online analytical processing
○ Analytics
○ Data Mining
○ Processing Mining
○ Complex Event Processing (CEP)
○ etc.
20. What’s Big Data
Business Intelligence (BI)
● Support a wide range of business decisions.
● BI Framework
○ Data Warehousing (DW)
■ It is constructed by integrating data from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries, and decision making [7].
■ It is considered a core component of B.I.
○ ETL’s (Extract - Transform - Load)
○ Analysis tools
21. What’s Big Data
Business Intelligence (BI)
● Online Analytical Processing (OLAP)
○ An approach which answers multi-dimensional analytical queries swiftly.
○ Encompases: relational database, report writing, and data mining.
○ Apps example: business reporting for sales, marketing, management reporting.
● Online Transactional Processing (OLTP)
○ A class of information systems that facilitate and manage transaction-oriented
applications, that is, it processes transactions rather than BI or reporting.
○ Much less complex queries (compared to OLAP), in a large volume.
23. What’s Big Data
Business Intelligence (BI)
● Drawbacks
○ Try to create perfect statistical models, even when they have already changed.
○ Describe the past, no predictions (future)
○ Assume that the data state is constant.
○ Do not well support video, audio, logs and unstructured data.
24. What’s Big Data
Data Science Roles
● Data Scientist
○ They are experienced data professionals in their organization who can query and process
data, provide reports, summarize and visualize data [8].
● Data Engineer
○ They are the data professionals who prepare the “big data” infrastructure to be analyzed by
Data Scientists, that is, design, build, integrate data from various resources, and manage big
data [8].
● Business Intelligence Developers
○ They are data experts that interact more closely with internal stakeholders to understand the
reporting needs, and then to collect requirements, design, and build BI and reporting solutions
for the company [8].
25. What’s Big Data
Think about Big Data Problem
● Task: come up with a problem (Big Data)
○ Time: 20 minutes
○ We together discuss each problem.
■ Explain your problem in few lines.
■ Explain why you think Big Data might be a good solution for it.
○ Material:
■ Post it and pens
26. Machine Learning
What’s Machine Learning?
● It gives computer the ability to learn without being explicitly programmed. [9]
● It is the capacity of a computer program to learn from experience E with
respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.
[10]
27. Machine Learning
Data Model
● It organizes the data elements and standardizes how these elements relate to
one another.
○ Algorithm builds a model from sample inputs.
● Applications:
○ Spam filtering, Search engines, computer vision, and others.
28. Machine Learning
Knowledge Discovery in Databases (KDD)
● It is the process of finding knowledge in data.
[http://www.rithme.eu/?m=home&p=kdprocess&lang=en]
29. Machine Learning
Exploring the data
● Frequent questions before starting the preprocessing stage
○ How many attributes? (Categorical / Numerical)
○ Are there missing values?
○ Are there scalar attributes (for numeric ones)?
○ Is there a label attribute? (Supervised / Unsupervised)
● Plot your data
○ A simple task that might show something you have not realized yet.
31. Machine Learning
Preprocessing
● It is the first stage of data mining in which the data is prepared for mining.
● This stage presents the following tasks:
○ Data cleaning
■ Remove noise and inconsistent data.
○ Data integration
■ Multiple data sources may be combined.
○ Data selection
■ Select a set of data from a given dataset.
○ Data transformation
■ Transform the data into more appropriate forms to be processed.
32. Machine Learning
Processing
● It is the second stage focused on data mining.
● This stage aims to extract data patterns by applying intelligent methods.
● Create a model by applying ML methods (Classification / Regression)
○ Linear regression
○ Naïve Bayes
○ SVM
○ Neural networks
○ and others.
33. Machine Learning
Analyze the results (Pattern evaluation)
● This stage identifies the truly interesting patterns representing knowledge
based on some interesting measures. [11]
● Finding a data pattern is an iterative process.
○ Try different models, metrics and analyze the results.
34. Machine Learning
How to get data?
● Data markets
○ UCI Repository
○ Datasets.co
○ Dublinked
● Competition and challenges
○ Kaggle
○ Data driven
○ Innocentive
35. Big Data Tools
Question
● Do the classic ML methods fit to Big Data problems?
Answer: No! The ML classical method does fit the Big data requirements.
36. Big Data Tools
-Architecture
● Process the data in Batch
and Real-Time.
● 3 Layers
○ Batch layer
○ Speed layer
○ Serving layer
[http://lambda-architecture.net/]
37. Big Data Tools
Hadoop
● It is an open source framework for writing and running distributed applications
that process large amounts of data. [12]
○ Parallel processing
● HDFS: Hadoop File System
○ It is based on GFS (Google File System)
● Mapreduce
○ It splits the input data-set into independent chunks which are processed by the
map tasks in a completely parallel manner.
○ Data structure (I/O)
■ Key-Value
38. Big Data Tools
Hadoop
● Features of Hadoop
○ Accessible: It runs on large clusters of commodity machines.
○ Robust: It handles most of failures.
○ Scalable: It scales linearity to handle large data by adding more node.
○ Simple: It allows you to quickly write efficient parallel code.
39. Big Data Tools
Hadoop: components
● Namenode
○ It applies the master/slave architecture. It is the master of HDFS that directs the
slave DataNode daemons to perform the low-level I/O tasks. [12]
○ It keeps track of files location (among nodes) and health of the distributed sys.
● DataNode
○ Each slave machine runs a DataNode daemon to perform the grunt work of the
distributed filesystem, that is, reading and writing HDFS blocks to actual files on
○ the local filesystem.
40. Big Data Tools
Hadoop: components
● Jobtracker
○ It is the liaison between your application and Hadoop - submit the code to cluster.
○ It determines the execution plan.
○ It holds an overall view of the system execution.
● Tasktracker
○ It manages the execution of individual tasks on each slave node.
○ It holds a local view (compared to Jobtracker).
41.
42.
43. Big Data Tools
Spark
● It is a fast and general engine for large-scale data processing. [13]
● In-memory processing
● A stack of libraries
○ Spark SQL
○ Spark Streaming
○ MLlib
○ GraphX
[http://spark.apache.org/]
44. Big Data Tools
Spark: performance
● It runs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk. [13]
○ Example: Logistic regression (see image below).
● Resilient Distributed Dataset (RDD)
○ It is an abstraction that enables developers to materialize any point in a processing
pipeline into memory across the cluster, meaning that future steps that want to
deal with the same data set need not recompute it or reload it from disk.
● It is well suited for highly iterative algorithms that require multiple passes over the data
set.
45. Big Data Tools
NoSQL
● Not only SQL / Non relational
○ It refers to a specific set of databases which have a mechanism for storage and
retrieval of data which that does not follow tabular relations used in common
relational databases.
● Why NoSQL?
○ Simplicity of design.
○ Finer control over availability.
○ Simpler "horizontal" scaling to clusters of machines
46. Big Data Tools
NoSQL
● NoSQL data structure
○ Key-Value
○ Wide column
○ Graph
○ Document
● Data structure of a NoSQL is more flexible.
● Compromise consistency in favor of: (Apply "eventual consistency")
○ Availability
○ Partition tolerance
○ Speed
47. Big Data Tools
NoSQL
● Drawbacks
○ Lack of standardized interfaces.
○ Low-level query languages.
○ Lost writes / Data loss
50. Big Data Tools
Big Data Ecosystem
[http://dataconomy.com/understanding-big-data-ecosystem/]
51. Big Data Tools
Think about Big Data Problem
● Task: how could big data tools fit to your problem?
○ Time: 20 minutes
○ We together discuss each problem.
■ Explain your problem in few lines.
■ Explain why your proposal might be a good solution for it.
○ Material:
■ Post it and pens
52. Practice
Programming Languages for Data science
● Python
○ It is a programming language with the following characteristics:
■ High-level;
■ General-purpose;
■ Interpreted language;
■ Dynamic programming;
■ Express concepts in fewer lines of code (i.e. compared to C++ and Java);
■ Indentation;
■ Cross-platform;
53. Practice
Programming Languages for Data science
● R
○ It is a software environment for statistical computing and graphics.
■ Features:
● Interpreted language;
● Widely used for statisticians and data miners;
● Several graphical-frontends available;
● Cross-platform;
54. Practice
Python packages for Machine Learning
● NumPy
○ Scientific computing:
■ A powerful N-dimensional array object.
■ Useful linear algebra, Fourier transform, and random number capabilities.
● Scikit-Learn
○ Machine learning:
■ Simple and efficient tools for data mining and data analysis.
● Scipy
○ Scientific computing and technical computing:
■ It depends on NumPy.
■ It provides many user-friendly and efficient numerical routines (e.g. numerical
integration)
55. Practice
Python packages for Machine Learning
● Pandas
○ It provides high-performance, easy-to-use data structures and data analysis tools.
■ Series
■ Dataframe
■ Panel
■ Panel4D / PanelND (Deprecated)
● Matplotlib
○ It is a 2D plotting library which produces publication quality figures in a variety of
hardcopy formats and interactive environments across platforms.
● There exists others packages in ML / Big Data world for Data science!
56. Practice
Anaconda
● It is a powerful collaboration and package management for open source and
private project.
○ Features:
■ Python and R programming;
■ Large-scale data processing;
■ Predictive analytics;
■ Scientific computing;
■ Simplify package management and deployment;
■ Cloud service
57. Practice
Jupyter Notebook
● It is a web application that allows you to create and share documents that
contain live code, equations, visualizations and explanatory text.
○ Features:
■ Web-based;
■ Interactive data science and scientific computing;
■ Support more than 40 programming languages;
■ Describe data analysis in a simple way;
■ Human-readable docs;
■ Big data integration
● Spark
58. Practice
Practice 01
● Introduction to Numpy
○ Array creation
○ Operations
○ Array transformations
○ Generate artificial data (Random sampling)
○ Statistical functions
● Find the introduction document on:
○ https://github.com/soutogustavo/data-science
■ Folder: Workshops / Cientec_2016_UFRN;
59. Practice
Practice 01
● Tasks:
○ Open Anaconda and start Jupyter Notebook
○ Create 2 numpy arrays (1x2 and 1x4) and perform:
■ Concatenate arrays;
■ Flat the concatenated array;
■ Reshape the array to 2x3;
○ Create 2 numpy arrays with 1000 samples:
■ Apply statistical functions (e.g. mean, var, std.);
● Additional Information:
○ Time: 30 minutes
○ Material:
■ Anaconda, Python and Numpy;
60. Practice
Practice 02
● Introduction to Pandas
○ Create Series
○ Create Dataframe
○ Generate artificial data
○ Transform Series to Dataframe
○ Load a Dataset
○ Drop column
○ Insert column
○ Statistical functions
● Find the introduction document on:
○ https://github.com/soutogustavo/data-science
■ Folder: Workshops / Cientec_2016_UFRN;
61. Practice
Practice 02
● Tasks:
○ Create Series from random sampling:
■ Number of samples: 500;
■ Apply statistical functions;
■ Transform Data Series into DataFrame;
○ Create DataFrame from random sampling (5 attributes):
■ Number of samples: 100;
■ Drop one column;
■ Create a label attribute and insert it to the dataframe;
○ Load data
■ source: http://www.jwall.org/streams/sample-stream.csv
■ Apply statistical functions for each attribute;
■ Find out the number of (possible) labels and count them;
● Additional Information:
○ Time: 40 minutes
○ Material:
■ Anaconda, Python,
Numpy and Pandas;
62. Conclusions
Next steps
● Books
○ Nathan, M. and Warren, J.. Big Data: principles and best practices of scalable
real-time data systems. Manning, 1st ed., 2015.
○ Lam, C. Hadoop in Action. Manning, 1st ed., 2015.
○ Karau, H. et. al. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly, 2015.
○ Lutz, M. Learning Python. O’Reilly, 5th ed., 2013.
64. References
1. IBM-01. What is big data? (2016). Retrieved from:
https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html.
2. Beal, V.. Structured Data (2016). Retrieved from:
http://www.webopedia.com/TERM/S/structured_data.html.
3. Driscoll, Michael E. How much data is "Big Data"? (2010). Retrieved from:
https://www.quora.com/How-much-data-is-Big-Data.
4. Zikopoulos et al.. Harness the Power of Big Data: The IBM Big Data
Platform. 2013. McGraw Hill. ISBN: 978-0-07-180817-0.
5. Wikipedia: Free Encyclopedia. Internet of Things (2016). Retrieved from:
https://en.wikipedia.org/wiki/Internet_of_things.
65. References
6. Wikipedia: Free Encyclopedia. Business Intelligence (2016). Retrieved from:
https://en.wikipedia.org/wiki/Business_intelligence.
7. Tutorials Points. Data Warehousing - Concepts (2016). Retrieved from:
http://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm.
8. Big Data University. Data Scientist vs Data Engineer, What’s the
difference? (2016) Retrieved from:
https://bigdatauniversity.com/blog/data-scientist-vs-data-engineer/.
9. Simon, P.. Too Big to Ignore: The Business Case for Big Data. Wiley. p.
89. March 18, 2013. ISBN 978-1-118-63817-0.
10. Mitchell, T.. Machine Learning. McGraw Hill, 1997. ISBN: 0070428077.
66. References
11. Han, J. and Kamber, M.. Data Mining: Concepts and Techniques. MK, 2nd
ed., 2006. ISBN-10: 1-55860-901-6.
12. Lam, C. Hadoop in Action. Manning, 2011. ISBN: 9781935182191.
13. Apache Spark. Spark (2016). Retrieved from: http://spark.apache.org/