SlideShare a Scribd company logo
1 of 66
Download to read offline
Big Data & Social
Analytics
Gustavo Souto, M.Sc.
Summary
● Part One: Theory about Big Data and Analytics
○ About Data
○ What’s Big Data?
○ Machine Learning
○ Big Data tools
● Part Two: Practice
○ Languages for Data Science
○ Main python packages for Machine Learning
○ Let's get some practice
● Part Three: Conclusions
○ Let’s recap!
○ Next steps
○ References
About me
Gustavo Souto
I am Ph.D. student at Federal University of Rio Grande
do Norte (UFRN). I have started the Ph.D. degree at
Technische Universität Dortmund (TU-Dortmund) in
Germany.
I also hold a Master's degree in Computer Engineering
from UFRN.
Topics of interest: Machine learning, Big Data, Data
stream, anomaly detection
About Data
Understanding the data
● How much data do we create every day?
○ It is 2.5 quintillion bytes of data [1].
● How about in 1992?
○ 100GB day.
● Let’s check out the table of data size to better understand the data.
About Data
Name Size Example
Byte 8 bits A single character
Kilobyte 1000 Bytes A compressed doc. Img. page. (50Kb)
Megabyte 1000 Kilobyte Digital book (5Mb)
Gigabyte 1000 Megabyte One recorded symphony in HiFi.
Terabyte 1000 Gigabyte Automated tape robot.
Petabyte 1000 Terabyte All academic libraries in EU (2PB)
Exabyte 1000 Petabyte All words said by all humanity so far. (5EX)
Zettabyte 1000 Exabyte
Yottabyte 1000 Zettabyte Current storage capacity of Internet.
About Data
Where do we find data?
● Text:
○ Documents and reports.
● Databases:
○ MySQL, PostgreSQL, Oracle etc.
● Geographic:
○ GPS and Maps.
● Social Media:
○ Facebook, Twitter, Instagram etc.
● Archives:
○ JSON, CSV, XML etc.
● APIs:
● Images and videos
About Data
About internal structure of data
● Structured data
○ The data follows a well-defined structure, that is, displayed in titled columns and
rows.
○ Examples: Datasets from MySQL, PostgreSQL, Oracles etc.
● Semistructured data
○ It is a type of structured data, but lacks the strict data model [2].
○ Examples: Emails, XML, JSON.
● Unstructured data
○ The data does not follow a structure.
○ Examples: Freedom text, comments fields, tweets.
What’s Big Data
The 4 V’s
● Volume
○ A massive volume of both structured and unstructured data.
● How much data is considered ‘Big Data’?
○ Terabytes, Petabytes, and so on.
○ Driscoll created a simple table that defines borders [3].
What’s Big Data
The 4 V’s
● Variety
○ Try to capture all of the data that pertains to our decision-making process [4].
● Data complements
○ Analyze different sources to drive decisions.
○ Most of the data out there is semistructured or unstructured.
■ Example: Customer call center (voice classification + customer’s record data
+ transaction history).
“This is the third outage I’ve had in one week!”
What’s Big Data
The 4 V’s
● Velocity
○ The rate at which data arrives at the enterprise and is processed or well
understood [4].
● “How long does it take you to do something about it or know it has even
arrived?” [4]
○ One pass processing.
■ Example: An enterprise facing a network security problem.
○ Data stream
■ Example: Netflix, Youtube, Sensors.
○ Velocity is one of the most overlooked areas in the Big Data.
What’s Big Data
The 4 V’s
● Veracity
○ It refers to the quality of data, or trustworthiness [4].
● Data transformations
○ Remove noise.
● Big spam
○ Untrustworthy information (noise).
○ Example: Some Tweets.
What’s Big Data
Companies and Data collection
● Several companies collect data
What’s Big Data
Cases
● Smart city
What’s Big Data
Cases
● Network security
What’s Big Data
Cases
● Social Medias
What’s Big Data
Cases
● Logistics
What’s Big Data
How about ethics?
● You are responsible for the data!
○ Be careful when you deal with them.
● Risks
○ (Think first!) Do the results bring risk to anyone?
■ Understand the risks of your decisions.
● Personally Identifiable Information: PII
○ Information that identify one person from another.
○ The data must anonymized.
○ Red flags: address, telephone number, geolocalization, codes, slangs.
What’s Big Data
Everything is connected
● Internet of Things (IoT)
○ It is an internetworking of physical devices,
vehicles, building, software, sensors,
actuators (other items), and network
connectivity that enables these objects to
collect and exchange data [5].
What’s Big Data
Business Intelligence (BI)
● Definition
○ It is a set of techniques and tools for the acquisition and transformation of raw data
into meaningful and useful information for business analysis purposes [6].
● Common functionalities
○ Reporting
○ Online analytical processing
○ Analytics
○ Data Mining
○ Processing Mining
○ Complex Event Processing (CEP)
○ etc.
What’s Big Data
Business Intelligence (BI)
● Support a wide range of business decisions.
● BI Framework
○ Data Warehousing (DW)
■ It is constructed by integrating data from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries, and decision making [7].
■ It is considered a core component of B.I.
○ ETL’s (Extract - Transform - Load)
○ Analysis tools
What’s Big Data
Business Intelligence (BI)
● Online Analytical Processing (OLAP)
○ An approach which answers multi-dimensional analytical queries swiftly.
○ Encompases: relational database, report writing, and data mining.
○ Apps example: business reporting for sales, marketing, management reporting.
● Online Transactional Processing (OLTP)
○ A class of information systems that facilitate and manage transaction-oriented
applications, that is, it processes transactions rather than BI or reporting.
○ Much less complex queries (compared to OLAP), in a large volume.
What’s Big Data
Business Intelligence (BI)
● BI Framework
What’s Big Data
Business Intelligence (BI)
● Drawbacks
○ Try to create perfect statistical models, even when they have already changed.
○ Describe the past, no predictions (future)
○ Assume that the data state is constant.
○ Do not well support video, audio, logs and unstructured data.
What’s Big Data
Data Science Roles
● Data Scientist
○ They are experienced data professionals in their organization who can query and process
data, provide reports, summarize and visualize data [8].
● Data Engineer
○ They are the data professionals who prepare the “big data” infrastructure to be analyzed by
Data Scientists, that is, design, build, integrate data from various resources, and manage big
data [8].
● Business Intelligence Developers
○ They are data experts that interact more closely with internal stakeholders to understand the
reporting needs, and then to collect requirements, design, and build BI and reporting solutions
for the company [8].
What’s Big Data
Think about Big Data Problem
● Task: come up with a problem (Big Data)
○ Time: 20 minutes
○ We together discuss each problem.
■ Explain your problem in few lines.
■ Explain why you think Big Data might be a good solution for it.
○ Material:
■ Post it and pens
Machine Learning
What’s Machine Learning?
● It gives computer the ability to learn without being explicitly programmed. [9]
● It is the capacity of a computer program to learn from experience E with
respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.
[10]
Machine Learning
Data Model
● It organizes the data elements and standardizes how these elements relate to
one another.
○ Algorithm builds a model from sample inputs.
● Applications:
○ Spam filtering, Search engines, computer vision, and others.
Machine Learning
Knowledge Discovery in Databases (KDD)
● It is the process of finding knowledge in data.
[http://www.rithme.eu/?m=home&p=kdprocess&lang=en]
Machine Learning
Exploring the data
● Frequent questions before starting the preprocessing stage
○ How many attributes? (Categorical / Numerical)
○ Are there missing values?
○ Are there scalar attributes (for numeric ones)?
○ Is there a label attribute? (Supervised / Unsupervised)
● Plot your data
○ A simple task that might show something you have not realized yet.
Machine Learning
Exploring the data
● Types of plot
○ Scatter
○ Histogram
○ Boxplot
○ Bars
○ and others
Machine Learning
Preprocessing
● It is the first stage of data mining in which the data is prepared for mining.
● This stage presents the following tasks:
○ Data cleaning
■ Remove noise and inconsistent data.
○ Data integration
■ Multiple data sources may be combined.
○ Data selection
■ Select a set of data from a given dataset.
○ Data transformation
■ Transform the data into more appropriate forms to be processed.
Machine Learning
Processing
● It is the second stage focused on data mining.
● This stage aims to extract data patterns by applying intelligent methods.
● Create a model by applying ML methods (Classification / Regression)
○ Linear regression
○ Naïve Bayes
○ SVM
○ Neural networks
○ and others.
Machine Learning
Analyze the results (Pattern evaluation)
● This stage identifies the truly interesting patterns representing knowledge
based on some interesting measures. [11]
● Finding a data pattern is an iterative process.
○ Try different models, metrics and analyze the results.
Machine Learning
How to get data?
● Data markets
○ UCI Repository
○ Datasets.co
○ Dublinked
● Competition and challenges
○ Kaggle
○ Data driven
○ Innocentive
Big Data Tools
Question
● Do the classic ML methods fit to Big Data problems?
Answer: No! The ML classical method does fit the Big data requirements.
Big Data Tools
-Architecture
● Process the data in Batch
and Real-Time.
● 3 Layers
○ Batch layer
○ Speed layer
○ Serving layer
[http://lambda-architecture.net/]
Big Data Tools
Hadoop
● It is an open source framework for writing and running distributed applications
that process large amounts of data. [12]
○ Parallel processing
● HDFS: Hadoop File System
○ It is based on GFS (Google File System)
● Mapreduce
○ It splits the input data-set into independent chunks which are processed by the
map tasks in a completely parallel manner.
○ Data structure (I/O)
■ Key-Value
Big Data Tools
Hadoop
● Features of Hadoop
○ Accessible: It runs on large clusters of commodity machines.
○ Robust: It handles most of failures.
○ Scalable: It scales linearity to handle large data by adding more node.
○ Simple: It allows you to quickly write efficient parallel code.
Big Data Tools
Hadoop: components
● Namenode
○ It applies the master/slave architecture. It is the master of HDFS that directs the
slave DataNode daemons to perform the low-level I/O tasks. [12]
○ It keeps track of files location (among nodes) and health of the distributed sys.
● DataNode
○ Each slave machine runs a DataNode daemon to perform the grunt work of the
distributed filesystem, that is, reading and writing HDFS blocks to actual files on
○ the local filesystem.
Big Data Tools
Hadoop: components
● Jobtracker
○ It is the liaison between your application and Hadoop - submit the code to cluster.
○ It determines the execution plan.
○ It holds an overall view of the system execution.
● Tasktracker
○ It manages the execution of individual tasks on each slave node.
○ It holds a local view (compared to Jobtracker).
Big Data Tools
Spark
● It is a fast and general engine for large-scale data processing. [13]
● In-memory processing
● A stack of libraries
○ Spark SQL
○ Spark Streaming
○ MLlib
○ GraphX
[http://spark.apache.org/]
Big Data Tools
Spark: performance
● It runs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk. [13]
○ Example: Logistic regression (see image below).
● Resilient Distributed Dataset (RDD)
○ It is an abstraction that enables developers to materialize any point in a processing
pipeline into memory across the cluster, meaning that future steps that want to
deal with the same data set need not recompute it or reload it from disk.
● It is well suited for highly iterative algorithms that require multiple passes over the data
set.
Big Data Tools
NoSQL
● Not only SQL / Non relational
○ It refers to a specific set of databases which have a mechanism for storage and
retrieval of data which that does not follow tabular relations used in common
relational databases.
● Why NoSQL?
○ Simplicity of design.
○ Finer control over availability.
○ Simpler "horizontal" scaling to clusters of machines
Big Data Tools
NoSQL
● NoSQL data structure
○ Key-Value
○ Wide column
○ Graph
○ Document
● Data structure of a NoSQL is more flexible.
● Compromise consistency in favor of: (Apply "eventual consistency")
○ Availability
○ Partition tolerance
○ Speed
Big Data Tools
NoSQL
● Drawbacks
○ Lack of standardized interfaces.
○ Low-level query languages.
○ Lost writes / Data loss
Big Data Tools
NoSQL: Databases
● Column
● Document
● Key-Value
● Graph
HyperDex
Big Data Tools
-Architecture
Big Data Tools
Big Data Ecosystem
[http://dataconomy.com/understanding-big-data-ecosystem/]
Big Data Tools
Think about Big Data Problem
● Task: how could big data tools fit to your problem?
○ Time: 20 minutes
○ We together discuss each problem.
■ Explain your problem in few lines.
■ Explain why your proposal might be a good solution for it.
○ Material:
■ Post it and pens
Practice
Programming Languages for Data science
● Python
○ It is a programming language with the following characteristics:
■ High-level;
■ General-purpose;
■ Interpreted language;
■ Dynamic programming;
■ Express concepts in fewer lines of code (i.e. compared to C++ and Java);
■ Indentation;
■ Cross-platform;
Practice
Programming Languages for Data science
● R
○ It is a software environment for statistical computing and graphics.
■ Features:
● Interpreted language;
● Widely used for statisticians and data miners;
● Several graphical-frontends available;
● Cross-platform;
Practice
Python packages for Machine Learning
● NumPy
○ Scientific computing:
■ A powerful N-dimensional array object.
■ Useful linear algebra, Fourier transform, and random number capabilities.
● Scikit-Learn
○ Machine learning:
■ Simple and efficient tools for data mining and data analysis.
● Scipy
○ Scientific computing and technical computing:
■ It depends on NumPy.
■ It provides many user-friendly and efficient numerical routines (e.g. numerical
integration)
Practice
Python packages for Machine Learning
● Pandas
○ It provides high-performance, easy-to-use data structures and data analysis tools.
■ Series
■ Dataframe
■ Panel
■ Panel4D / PanelND (Deprecated)
● Matplotlib
○ It is a 2D plotting library which produces publication quality figures in a variety of
hardcopy formats and interactive environments across platforms.
● There exists others packages in ML / Big Data world for Data science!
Practice
Anaconda
● It is a powerful collaboration and package management for open source and
private project.
○ Features:
■ Python and R programming;
■ Large-scale data processing;
■ Predictive analytics;
■ Scientific computing;
■ Simplify package management and deployment;
■ Cloud service
Practice
Jupyter Notebook
● It is a web application that allows you to create and share documents that
contain live code, equations, visualizations and explanatory text.
○ Features:
■ Web-based;
■ Interactive data science and scientific computing;
■ Support more than 40 programming languages;
■ Describe data analysis in a simple way;
■ Human-readable docs;
■ Big data integration
● Spark
Practice
Practice 01
● Introduction to Numpy
○ Array creation
○ Operations
○ Array transformations
○ Generate artificial data (Random sampling)
○ Statistical functions
● Find the introduction document on:
○ https://github.com/soutogustavo/data-science
■ Folder: Workshops / Cientec_2016_UFRN;
Practice
Practice 01
● Tasks:
○ Open Anaconda and start Jupyter Notebook
○ Create 2 numpy arrays (1x2 and 1x4) and perform:
■ Concatenate arrays;
■ Flat the concatenated array;
■ Reshape the array to 2x3;
○ Create 2 numpy arrays with 1000 samples:
■ Apply statistical functions (e.g. mean, var, std.);
● Additional Information:
○ Time: 30 minutes
○ Material:
■ Anaconda, Python and Numpy;
Practice
Practice 02
● Introduction to Pandas
○ Create Series
○ Create Dataframe
○ Generate artificial data
○ Transform Series to Dataframe
○ Load a Dataset
○ Drop column
○ Insert column
○ Statistical functions
● Find the introduction document on:
○ https://github.com/soutogustavo/data-science
■ Folder: Workshops / Cientec_2016_UFRN;
Practice
Practice 02
● Tasks:
○ Create Series from random sampling:
■ Number of samples: 500;
■ Apply statistical functions;
■ Transform Data Series into DataFrame;
○ Create DataFrame from random sampling (5 attributes):
■ Number of samples: 100;
■ Drop one column;
■ Create a label attribute and insert it to the dataframe;
○ Load data
■ source: http://www.jwall.org/streams/sample-stream.csv
■ Apply statistical functions for each attribute;
■ Find out the number of (possible) labels and count them;
● Additional Information:
○ Time: 40 minutes
○ Material:
■ Anaconda, Python,
Numpy and Pandas;
Conclusions
Next steps
● Books
○ Nathan, M. and Warren, J.. Big Data: principles and best practices of scalable
real-time data systems. Manning, 1st ed., 2015.
○ Lam, C. Hadoop in Action. Manning, 1st ed., 2015.
○ Karau, H. et. al. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly, 2015.
○ Lutz, M. Learning Python. O’Reilly, 5th ed., 2013.
Thank you!
Contact
Email: ghsouto@gmail.com
Github: soutogustavo
References
1. IBM-01. What is big data? (2016). Retrieved from:
https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html.
2. Beal, V.. Structured Data (2016). Retrieved from:
http://www.webopedia.com/TERM/S/structured_data.html.
3. Driscoll, Michael E. How much data is "Big Data"? (2010). Retrieved from:
https://www.quora.com/How-much-data-is-Big-Data.
4. Zikopoulos et al.. Harness the Power of Big Data: The IBM Big Data
Platform. 2013. McGraw Hill. ISBN: 978-0-07-180817-0.
5. Wikipedia: Free Encyclopedia. Internet of Things (2016). Retrieved from:
https://en.wikipedia.org/wiki/Internet_of_things.
References
6. Wikipedia: Free Encyclopedia. Business Intelligence (2016). Retrieved from:
https://en.wikipedia.org/wiki/Business_intelligence.
7. Tutorials Points. Data Warehousing - Concepts (2016). Retrieved from:
http://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm.
8. Big Data University. Data Scientist vs Data Engineer, What’s the
difference? (2016) Retrieved from:
https://bigdatauniversity.com/blog/data-scientist-vs-data-engineer/.
9. Simon, P.. Too Big to Ignore: The Business Case for Big Data. Wiley. p.
89. March 18, 2013. ISBN 978-1-118-63817-0.
10. Mitchell, T.. Machine Learning. McGraw Hill, 1997. ISBN: 0070428077.
References
11. Han, J. and Kamber, M.. Data Mining: Concepts and Techniques. MK, 2nd
ed., 2006. ISBN-10: 1-55860-901-6.
12. Lam, C. Hadoop in Action. Manning, 2011. ISBN: 9781935182191.
13. Apache Spark. Spark (2016). Retrieved from: http://spark.apache.org/

More Related Content

What's hot

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnetcaise2013vlc
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?Noam Cohen
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueMehmet Beyaz
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Marko Grobelnik
 

What's hot (20)

Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid language
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
Data mining and knowledge Discovery
Data mining and knowledge DiscoveryData mining and knowledge Discovery
Data mining and knowledge Discovery
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data mining
Data miningData mining
Data mining
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 

Similar to Big Data & Social Analytics presentation

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DSRoopesh Kohad
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career pathRubikal
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining KindergartenAlexey Zinoviev
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Data Analytics.03. Data processing
Data Analytics.03. Data processingData Analytics.03. Data processing
Data Analytics.03. Data processingAlex Rayón Jerez
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist Manjunath Sindagi
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Manjunath Sindagi
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 

Similar to Big Data & Social Analytics presentation (20)

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
 
Data science guide
Data science guideData science guide
Data science guide
 
Data science
Data scienceData science
Data science
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Data Analytics.03. Data processing
Data Analytics.03. Data processingData Analytics.03. Data processing
Data Analytics.03. Data processing
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
Unit - I FDS.pdf
Unit - I FDS.pdfUnit - I FDS.pdf
Unit - I FDS.pdf
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 

Big Data & Social Analytics presentation

  • 1. Big Data & Social Analytics Gustavo Souto, M.Sc.
  • 2. Summary ● Part One: Theory about Big Data and Analytics ○ About Data ○ What’s Big Data? ○ Machine Learning ○ Big Data tools ● Part Two: Practice ○ Languages for Data Science ○ Main python packages for Machine Learning ○ Let's get some practice ● Part Three: Conclusions ○ Let’s recap! ○ Next steps ○ References
  • 3. About me Gustavo Souto I am Ph.D. student at Federal University of Rio Grande do Norte (UFRN). I have started the Ph.D. degree at Technische Universität Dortmund (TU-Dortmund) in Germany. I also hold a Master's degree in Computer Engineering from UFRN. Topics of interest: Machine learning, Big Data, Data stream, anomaly detection
  • 4. About Data Understanding the data ● How much data do we create every day? ○ It is 2.5 quintillion bytes of data [1]. ● How about in 1992? ○ 100GB day. ● Let’s check out the table of data size to better understand the data.
  • 5. About Data Name Size Example Byte 8 bits A single character Kilobyte 1000 Bytes A compressed doc. Img. page. (50Kb) Megabyte 1000 Kilobyte Digital book (5Mb) Gigabyte 1000 Megabyte One recorded symphony in HiFi. Terabyte 1000 Gigabyte Automated tape robot. Petabyte 1000 Terabyte All academic libraries in EU (2PB) Exabyte 1000 Petabyte All words said by all humanity so far. (5EX) Zettabyte 1000 Exabyte Yottabyte 1000 Zettabyte Current storage capacity of Internet.
  • 6. About Data Where do we find data? ● Text: ○ Documents and reports. ● Databases: ○ MySQL, PostgreSQL, Oracle etc. ● Geographic: ○ GPS and Maps. ● Social Media: ○ Facebook, Twitter, Instagram etc. ● Archives: ○ JSON, CSV, XML etc. ● APIs: ● Images and videos
  • 7. About Data About internal structure of data ● Structured data ○ The data follows a well-defined structure, that is, displayed in titled columns and rows. ○ Examples: Datasets from MySQL, PostgreSQL, Oracles etc. ● Semistructured data ○ It is a type of structured data, but lacks the strict data model [2]. ○ Examples: Emails, XML, JSON. ● Unstructured data ○ The data does not follow a structure. ○ Examples: Freedom text, comments fields, tweets.
  • 8. What’s Big Data The 4 V’s ● Volume ○ A massive volume of both structured and unstructured data. ● How much data is considered ‘Big Data’? ○ Terabytes, Petabytes, and so on. ○ Driscoll created a simple table that defines borders [3].
  • 9. What’s Big Data The 4 V’s ● Variety ○ Try to capture all of the data that pertains to our decision-making process [4]. ● Data complements ○ Analyze different sources to drive decisions. ○ Most of the data out there is semistructured or unstructured. ■ Example: Customer call center (voice classification + customer’s record data + transaction history). “This is the third outage I’ve had in one week!”
  • 10. What’s Big Data The 4 V’s ● Velocity ○ The rate at which data arrives at the enterprise and is processed or well understood [4]. ● “How long does it take you to do something about it or know it has even arrived?” [4] ○ One pass processing. ■ Example: An enterprise facing a network security problem. ○ Data stream ■ Example: Netflix, Youtube, Sensors. ○ Velocity is one of the most overlooked areas in the Big Data.
  • 11. What’s Big Data The 4 V’s ● Veracity ○ It refers to the quality of data, or trustworthiness [4]. ● Data transformations ○ Remove noise. ● Big spam ○ Untrustworthy information (noise). ○ Example: Some Tweets.
  • 12. What’s Big Data Companies and Data collection ● Several companies collect data
  • 14. What’s Big Data Cases ● Network security
  • 17. What’s Big Data How about ethics? ● You are responsible for the data! ○ Be careful when you deal with them. ● Risks ○ (Think first!) Do the results bring risk to anyone? ■ Understand the risks of your decisions. ● Personally Identifiable Information: PII ○ Information that identify one person from another. ○ The data must anonymized. ○ Red flags: address, telephone number, geolocalization, codes, slangs.
  • 18. What’s Big Data Everything is connected ● Internet of Things (IoT) ○ It is an internetworking of physical devices, vehicles, building, software, sensors, actuators (other items), and network connectivity that enables these objects to collect and exchange data [5].
  • 19. What’s Big Data Business Intelligence (BI) ● Definition ○ It is a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes [6]. ● Common functionalities ○ Reporting ○ Online analytical processing ○ Analytics ○ Data Mining ○ Processing Mining ○ Complex Event Processing (CEP) ○ etc.
  • 20. What’s Big Data Business Intelligence (BI) ● Support a wide range of business decisions. ● BI Framework ○ Data Warehousing (DW) ■ It is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making [7]. ■ It is considered a core component of B.I. ○ ETL’s (Extract - Transform - Load) ○ Analysis tools
  • 21. What’s Big Data Business Intelligence (BI) ● Online Analytical Processing (OLAP) ○ An approach which answers multi-dimensional analytical queries swiftly. ○ Encompases: relational database, report writing, and data mining. ○ Apps example: business reporting for sales, marketing, management reporting. ● Online Transactional Processing (OLTP) ○ A class of information systems that facilitate and manage transaction-oriented applications, that is, it processes transactions rather than BI or reporting. ○ Much less complex queries (compared to OLAP), in a large volume.
  • 22. What’s Big Data Business Intelligence (BI) ● BI Framework
  • 23. What’s Big Data Business Intelligence (BI) ● Drawbacks ○ Try to create perfect statistical models, even when they have already changed. ○ Describe the past, no predictions (future) ○ Assume that the data state is constant. ○ Do not well support video, audio, logs and unstructured data.
  • 24. What’s Big Data Data Science Roles ● Data Scientist ○ They are experienced data professionals in their organization who can query and process data, provide reports, summarize and visualize data [8]. ● Data Engineer ○ They are the data professionals who prepare the “big data” infrastructure to be analyzed by Data Scientists, that is, design, build, integrate data from various resources, and manage big data [8]. ● Business Intelligence Developers ○ They are data experts that interact more closely with internal stakeholders to understand the reporting needs, and then to collect requirements, design, and build BI and reporting solutions for the company [8].
  • 25. What’s Big Data Think about Big Data Problem ● Task: come up with a problem (Big Data) ○ Time: 20 minutes ○ We together discuss each problem. ■ Explain your problem in few lines. ■ Explain why you think Big Data might be a good solution for it. ○ Material: ■ Post it and pens
  • 26. Machine Learning What’s Machine Learning? ● It gives computer the ability to learn without being explicitly programmed. [9] ● It is the capacity of a computer program to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. [10]
  • 27. Machine Learning Data Model ● It organizes the data elements and standardizes how these elements relate to one another. ○ Algorithm builds a model from sample inputs. ● Applications: ○ Spam filtering, Search engines, computer vision, and others.
  • 28. Machine Learning Knowledge Discovery in Databases (KDD) ● It is the process of finding knowledge in data. [http://www.rithme.eu/?m=home&p=kdprocess&lang=en]
  • 29. Machine Learning Exploring the data ● Frequent questions before starting the preprocessing stage ○ How many attributes? (Categorical / Numerical) ○ Are there missing values? ○ Are there scalar attributes (for numeric ones)? ○ Is there a label attribute? (Supervised / Unsupervised) ● Plot your data ○ A simple task that might show something you have not realized yet.
  • 30. Machine Learning Exploring the data ● Types of plot ○ Scatter ○ Histogram ○ Boxplot ○ Bars ○ and others
  • 31. Machine Learning Preprocessing ● It is the first stage of data mining in which the data is prepared for mining. ● This stage presents the following tasks: ○ Data cleaning ■ Remove noise and inconsistent data. ○ Data integration ■ Multiple data sources may be combined. ○ Data selection ■ Select a set of data from a given dataset. ○ Data transformation ■ Transform the data into more appropriate forms to be processed.
  • 32. Machine Learning Processing ● It is the second stage focused on data mining. ● This stage aims to extract data patterns by applying intelligent methods. ● Create a model by applying ML methods (Classification / Regression) ○ Linear regression ○ Naïve Bayes ○ SVM ○ Neural networks ○ and others.
  • 33. Machine Learning Analyze the results (Pattern evaluation) ● This stage identifies the truly interesting patterns representing knowledge based on some interesting measures. [11] ● Finding a data pattern is an iterative process. ○ Try different models, metrics and analyze the results.
  • 34. Machine Learning How to get data? ● Data markets ○ UCI Repository ○ Datasets.co ○ Dublinked ● Competition and challenges ○ Kaggle ○ Data driven ○ Innocentive
  • 35. Big Data Tools Question ● Do the classic ML methods fit to Big Data problems? Answer: No! The ML classical method does fit the Big data requirements.
  • 36. Big Data Tools -Architecture ● Process the data in Batch and Real-Time. ● 3 Layers ○ Batch layer ○ Speed layer ○ Serving layer [http://lambda-architecture.net/]
  • 37. Big Data Tools Hadoop ● It is an open source framework for writing and running distributed applications that process large amounts of data. [12] ○ Parallel processing ● HDFS: Hadoop File System ○ It is based on GFS (Google File System) ● Mapreduce ○ It splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. ○ Data structure (I/O) ■ Key-Value
  • 38. Big Data Tools Hadoop ● Features of Hadoop ○ Accessible: It runs on large clusters of commodity machines. ○ Robust: It handles most of failures. ○ Scalable: It scales linearity to handle large data by adding more node. ○ Simple: It allows you to quickly write efficient parallel code.
  • 39. Big Data Tools Hadoop: components ● Namenode ○ It applies the master/slave architecture. It is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. [12] ○ It keeps track of files location (among nodes) and health of the distributed sys. ● DataNode ○ Each slave machine runs a DataNode daemon to perform the grunt work of the distributed filesystem, that is, reading and writing HDFS blocks to actual files on ○ the local filesystem.
  • 40. Big Data Tools Hadoop: components ● Jobtracker ○ It is the liaison between your application and Hadoop - submit the code to cluster. ○ It determines the execution plan. ○ It holds an overall view of the system execution. ● Tasktracker ○ It manages the execution of individual tasks on each slave node. ○ It holds a local view (compared to Jobtracker).
  • 41.
  • 42.
  • 43. Big Data Tools Spark ● It is a fast and general engine for large-scale data processing. [13] ● In-memory processing ● A stack of libraries ○ Spark SQL ○ Spark Streaming ○ MLlib ○ GraphX [http://spark.apache.org/]
  • 44. Big Data Tools Spark: performance ● It runs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. [13] ○ Example: Logistic regression (see image below). ● Resilient Distributed Dataset (RDD) ○ It is an abstraction that enables developers to materialize any point in a processing pipeline into memory across the cluster, meaning that future steps that want to deal with the same data set need not recompute it or reload it from disk. ● It is well suited for highly iterative algorithms that require multiple passes over the data set.
  • 45. Big Data Tools NoSQL ● Not only SQL / Non relational ○ It refers to a specific set of databases which have a mechanism for storage and retrieval of data which that does not follow tabular relations used in common relational databases. ● Why NoSQL? ○ Simplicity of design. ○ Finer control over availability. ○ Simpler "horizontal" scaling to clusters of machines
  • 46. Big Data Tools NoSQL ● NoSQL data structure ○ Key-Value ○ Wide column ○ Graph ○ Document ● Data structure of a NoSQL is more flexible. ● Compromise consistency in favor of: (Apply "eventual consistency") ○ Availability ○ Partition tolerance ○ Speed
  • 47. Big Data Tools NoSQL ● Drawbacks ○ Lack of standardized interfaces. ○ Low-level query languages. ○ Lost writes / Data loss
  • 48. Big Data Tools NoSQL: Databases ● Column ● Document ● Key-Value ● Graph HyperDex
  • 50. Big Data Tools Big Data Ecosystem [http://dataconomy.com/understanding-big-data-ecosystem/]
  • 51. Big Data Tools Think about Big Data Problem ● Task: how could big data tools fit to your problem? ○ Time: 20 minutes ○ We together discuss each problem. ■ Explain your problem in few lines. ■ Explain why your proposal might be a good solution for it. ○ Material: ■ Post it and pens
  • 52. Practice Programming Languages for Data science ● Python ○ It is a programming language with the following characteristics: ■ High-level; ■ General-purpose; ■ Interpreted language; ■ Dynamic programming; ■ Express concepts in fewer lines of code (i.e. compared to C++ and Java); ■ Indentation; ■ Cross-platform;
  • 53. Practice Programming Languages for Data science ● R ○ It is a software environment for statistical computing and graphics. ■ Features: ● Interpreted language; ● Widely used for statisticians and data miners; ● Several graphical-frontends available; ● Cross-platform;
  • 54. Practice Python packages for Machine Learning ● NumPy ○ Scientific computing: ■ A powerful N-dimensional array object. ■ Useful linear algebra, Fourier transform, and random number capabilities. ● Scikit-Learn ○ Machine learning: ■ Simple and efficient tools for data mining and data analysis. ● Scipy ○ Scientific computing and technical computing: ■ It depends on NumPy. ■ It provides many user-friendly and efficient numerical routines (e.g. numerical integration)
  • 55. Practice Python packages for Machine Learning ● Pandas ○ It provides high-performance, easy-to-use data structures and data analysis tools. ■ Series ■ Dataframe ■ Panel ■ Panel4D / PanelND (Deprecated) ● Matplotlib ○ It is a 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. ● There exists others packages in ML / Big Data world for Data science!
  • 56. Practice Anaconda ● It is a powerful collaboration and package management for open source and private project. ○ Features: ■ Python and R programming; ■ Large-scale data processing; ■ Predictive analytics; ■ Scientific computing; ■ Simplify package management and deployment; ■ Cloud service
  • 57. Practice Jupyter Notebook ● It is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. ○ Features: ■ Web-based; ■ Interactive data science and scientific computing; ■ Support more than 40 programming languages; ■ Describe data analysis in a simple way; ■ Human-readable docs; ■ Big data integration ● Spark
  • 58. Practice Practice 01 ● Introduction to Numpy ○ Array creation ○ Operations ○ Array transformations ○ Generate artificial data (Random sampling) ○ Statistical functions ● Find the introduction document on: ○ https://github.com/soutogustavo/data-science ■ Folder: Workshops / Cientec_2016_UFRN;
  • 59. Practice Practice 01 ● Tasks: ○ Open Anaconda and start Jupyter Notebook ○ Create 2 numpy arrays (1x2 and 1x4) and perform: ■ Concatenate arrays; ■ Flat the concatenated array; ■ Reshape the array to 2x3; ○ Create 2 numpy arrays with 1000 samples: ■ Apply statistical functions (e.g. mean, var, std.); ● Additional Information: ○ Time: 30 minutes ○ Material: ■ Anaconda, Python and Numpy;
  • 60. Practice Practice 02 ● Introduction to Pandas ○ Create Series ○ Create Dataframe ○ Generate artificial data ○ Transform Series to Dataframe ○ Load a Dataset ○ Drop column ○ Insert column ○ Statistical functions ● Find the introduction document on: ○ https://github.com/soutogustavo/data-science ■ Folder: Workshops / Cientec_2016_UFRN;
  • 61. Practice Practice 02 ● Tasks: ○ Create Series from random sampling: ■ Number of samples: 500; ■ Apply statistical functions; ■ Transform Data Series into DataFrame; ○ Create DataFrame from random sampling (5 attributes): ■ Number of samples: 100; ■ Drop one column; ■ Create a label attribute and insert it to the dataframe; ○ Load data ■ source: http://www.jwall.org/streams/sample-stream.csv ■ Apply statistical functions for each attribute; ■ Find out the number of (possible) labels and count them; ● Additional Information: ○ Time: 40 minutes ○ Material: ■ Anaconda, Python, Numpy and Pandas;
  • 62. Conclusions Next steps ● Books ○ Nathan, M. and Warren, J.. Big Data: principles and best practices of scalable real-time data systems. Manning, 1st ed., 2015. ○ Lam, C. Hadoop in Action. Manning, 1st ed., 2015. ○ Karau, H. et. al. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly, 2015. ○ Lutz, M. Learning Python. O’Reilly, 5th ed., 2013.
  • 64. References 1. IBM-01. What is big data? (2016). Retrieved from: https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html. 2. Beal, V.. Structured Data (2016). Retrieved from: http://www.webopedia.com/TERM/S/structured_data.html. 3. Driscoll, Michael E. How much data is "Big Data"? (2010). Retrieved from: https://www.quora.com/How-much-data-is-Big-Data. 4. Zikopoulos et al.. Harness the Power of Big Data: The IBM Big Data Platform. 2013. McGraw Hill. ISBN: 978-0-07-180817-0. 5. Wikipedia: Free Encyclopedia. Internet of Things (2016). Retrieved from: https://en.wikipedia.org/wiki/Internet_of_things.
  • 65. References 6. Wikipedia: Free Encyclopedia. Business Intelligence (2016). Retrieved from: https://en.wikipedia.org/wiki/Business_intelligence. 7. Tutorials Points. Data Warehousing - Concepts (2016). Retrieved from: http://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm. 8. Big Data University. Data Scientist vs Data Engineer, What’s the difference? (2016) Retrieved from: https://bigdatauniversity.com/blog/data-scientist-vs-data-engineer/. 9. Simon, P.. Too Big to Ignore: The Business Case for Big Data. Wiley. p. 89. March 18, 2013. ISBN 978-1-118-63817-0. 10. Mitchell, T.. Machine Learning. McGraw Hill, 1997. ISBN: 0070428077.
  • 66. References 11. Han, J. and Kamber, M.. Data Mining: Concepts and Techniques. MK, 2nd ed., 2006. ISBN-10: 1-55860-901-6. 12. Lam, C. Hadoop in Action. Manning, 2011. ISBN: 9781935182191. 13. Apache Spark. Spark (2016). Retrieved from: http://spark.apache.org/