The document provides an overview of big data and social analytics, covering topics such as the definition of big data, machine learning, common big data tools like Hadoop and Spark, programming languages for data science like Python and R, and packages for machine learning in Python. It also discusses practical applications of big data and introduces exercises for hands-on practice with tools like NumPy in Jupyter notebooks.
Dirty data? Clean it up! - Datapalooza Denver 2016
Big Data & Social Analytics presentation
1. Big Data & Social
Analytics
Gustavo Souto, M.Sc.
2. Summary
● Part One: Theory about Big Data and Analytics
○ About Data
○ What’s Big Data?
○ Machine Learning
○ Big Data tools
● Part Two: Practice
○ Languages for Data Science
○ Main python packages for Machine Learning
○ Let's get some practice
● Part Three: Conclusions
○ Let’s recap!
○ Next steps
○ References
3. About me
Gustavo Souto
I am Ph.D. student at Federal University of Rio Grande
do Norte (UFRN). I have started the Ph.D. degree at
Technische Universität Dortmund (TU-Dortmund) in
Germany.
I also hold a Master's degree in Computer Engineering
from UFRN.
Topics of interest: Machine learning, Big Data, Data
stream, anomaly detection
4. About Data
Understanding the data
● How much data do we create every day?
○ It is 2.5 quintillion bytes of data [1].
● How about in 1992?
○ 100GB day.
● Let’s check out the table of data size to better understand the data.
5. About Data
Name Size Example
Byte 8 bits A single character
Kilobyte 1000 Bytes A compressed doc. Img. page. (50Kb)
Megabyte 1000 Kilobyte Digital book (5Mb)
Gigabyte 1000 Megabyte One recorded symphony in HiFi.
Terabyte 1000 Gigabyte Automated tape robot.
Petabyte 1000 Terabyte All academic libraries in EU (2PB)
Exabyte 1000 Petabyte All words said by all humanity so far. (5EX)
Zettabyte 1000 Exabyte
Yottabyte 1000 Zettabyte Current storage capacity of Internet.
6. About Data
Where do we find data?
● Text:
○ Documents and reports.
● Databases:
○ MySQL, PostgreSQL, Oracle etc.
● Geographic:
○ GPS and Maps.
● Social Media:
○ Facebook, Twitter, Instagram etc.
● Archives:
○ JSON, CSV, XML etc.
● APIs:
● Images and videos
7. About Data
About internal structure of data
● Structured data
○ The data follows a well-defined structure, that is, displayed in titled columns and
rows.
○ Examples: Datasets from MySQL, PostgreSQL, Oracles etc.
● Semistructured data
○ It is a type of structured data, but lacks the strict data model [2].
○ Examples: Emails, XML, JSON.
● Unstructured data
○ The data does not follow a structure.
○ Examples: Freedom text, comments fields, tweets.
8. What’s Big Data
The 4 V’s
● Volume
○ A massive volume of both structured and unstructured data.
● How much data is considered ‘Big Data’?
○ Terabytes, Petabytes, and so on.
○ Driscoll created a simple table that defines borders [3].
9. What’s Big Data
The 4 V’s
● Variety
○ Try to capture all of the data that pertains to our decision-making process [4].
● Data complements
○ Analyze different sources to drive decisions.
○ Most of the data out there is semistructured or unstructured.
■ Example: Customer call center (voice classification + customer’s record data
+ transaction history).
“This is the third outage I’ve had in one week!”
10. What’s Big Data
The 4 V’s
● Velocity
○ The rate at which data arrives at the enterprise and is processed or well
understood [4].
● “How long does it take you to do something about it or know it has even
arrived?” [4]
○ One pass processing.
■ Example: An enterprise facing a network security problem.
○ Data stream
■ Example: Netflix, Youtube, Sensors.
○ Velocity is one of the most overlooked areas in the Big Data.
11. What’s Big Data
The 4 V’s
● Veracity
○ It refers to the quality of data, or trustworthiness [4].
● Data transformations
○ Remove noise.
● Big spam
○ Untrustworthy information (noise).
○ Example: Some Tweets.
17. What’s Big Data
How about ethics?
● You are responsible for the data!
○ Be careful when you deal with them.
● Risks
○ (Think first!) Do the results bring risk to anyone?
■ Understand the risks of your decisions.
● Personally Identifiable Information: PII
○ Information that identify one person from another.
○ The data must anonymized.
○ Red flags: address, telephone number, geolocalization, codes, slangs.
18. What’s Big Data
Everything is connected
● Internet of Things (IoT)
○ It is an internetworking of physical devices,
vehicles, building, software, sensors,
actuators (other items), and network
connectivity that enables these objects to
collect and exchange data [5].
19. What’s Big Data
Business Intelligence (BI)
● Definition
○ It is a set of techniques and tools for the acquisition and transformation of raw data
into meaningful and useful information for business analysis purposes [6].
● Common functionalities
○ Reporting
○ Online analytical processing
○ Analytics
○ Data Mining
○ Processing Mining
○ Complex Event Processing (CEP)
○ etc.
20. What’s Big Data
Business Intelligence (BI)
● Support a wide range of business decisions.
● BI Framework
○ Data Warehousing (DW)
■ It is constructed by integrating data from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries, and decision making [7].
■ It is considered a core component of B.I.
○ ETL’s (Extract - Transform - Load)
○ Analysis tools
21. What’s Big Data
Business Intelligence (BI)
● Online Analytical Processing (OLAP)
○ An approach which answers multi-dimensional analytical queries swiftly.
○ Encompases: relational database, report writing, and data mining.
○ Apps example: business reporting for sales, marketing, management reporting.
● Online Transactional Processing (OLTP)
○ A class of information systems that facilitate and manage transaction-oriented
applications, that is, it processes transactions rather than BI or reporting.
○ Much less complex queries (compared to OLAP), in a large volume.
23. What’s Big Data
Business Intelligence (BI)
● Drawbacks
○ Try to create perfect statistical models, even when they have already changed.
○ Describe the past, no predictions (future)
○ Assume that the data state is constant.
○ Do not well support video, audio, logs and unstructured data.
24. What’s Big Data
Data Science Roles
● Data Scientist
○ They are experienced data professionals in their organization who can query and process
data, provide reports, summarize and visualize data [8].
● Data Engineer
○ They are the data professionals who prepare the “big data” infrastructure to be analyzed by
Data Scientists, that is, design, build, integrate data from various resources, and manage big
data [8].
● Business Intelligence Developers
○ They are data experts that interact more closely with internal stakeholders to understand the
reporting needs, and then to collect requirements, design, and build BI and reporting solutions
for the company [8].
25. What’s Big Data
Think about Big Data Problem
● Task: come up with a problem (Big Data)
○ Time: 20 minutes
○ We together discuss each problem.
■ Explain your problem in few lines.
■ Explain why you think Big Data might be a good solution for it.
○ Material:
■ Post it and pens
26. Machine Learning
What’s Machine Learning?
● It gives computer the ability to learn without being explicitly programmed. [9]
● It is the capacity of a computer program to learn from experience E with
respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.
[10]
27. Machine Learning
Data Model
● It organizes the data elements and standardizes how these elements relate to
one another.
○ Algorithm builds a model from sample inputs.
● Applications:
○ Spam filtering, Search engines, computer vision, and others.
28. Machine Learning
Knowledge Discovery in Databases (KDD)
● It is the process of finding knowledge in data.
[http://www.rithme.eu/?m=home&p=kdprocess&lang=en]
29. Machine Learning
Exploring the data
● Frequent questions before starting the preprocessing stage
○ How many attributes? (Categorical / Numerical)
○ Are there missing values?
○ Are there scalar attributes (for numeric ones)?
○ Is there a label attribute? (Supervised / Unsupervised)
● Plot your data
○ A simple task that might show something you have not realized yet.
31. Machine Learning
Preprocessing
● It is the first stage of data mining in which the data is prepared for mining.
● This stage presents the following tasks:
○ Data cleaning
■ Remove noise and inconsistent data.
○ Data integration
■ Multiple data sources may be combined.
○ Data selection
■ Select a set of data from a given dataset.
○ Data transformation
■ Transform the data into more appropriate forms to be processed.
32. Machine Learning
Processing
● It is the second stage focused on data mining.
● This stage aims to extract data patterns by applying intelligent methods.
● Create a model by applying ML methods (Classification / Regression)
○ Linear regression
○ Naïve Bayes
○ SVM
○ Neural networks
○ and others.
33. Machine Learning
Analyze the results (Pattern evaluation)
● This stage identifies the truly interesting patterns representing knowledge
based on some interesting measures. [11]
● Finding a data pattern is an iterative process.
○ Try different models, metrics and analyze the results.
34. Machine Learning
How to get data?
● Data markets
○ UCI Repository
○ Datasets.co
○ Dublinked
● Competition and challenges
○ Kaggle
○ Data driven
○ Innocentive
35. Big Data Tools
Question
● Do the classic ML methods fit to Big Data problems?
Answer: No! The ML classical method does fit the Big data requirements.
36. Big Data Tools
-Architecture
● Process the data in Batch
and Real-Time.
● 3 Layers
○ Batch layer
○ Speed layer
○ Serving layer
[http://lambda-architecture.net/]
37. Big Data Tools
Hadoop
● It is an open source framework for writing and running distributed applications
that process large amounts of data. [12]
○ Parallel processing
● HDFS: Hadoop File System
○ It is based on GFS (Google File System)
● Mapreduce
○ It splits the input data-set into independent chunks which are processed by the
map tasks in a completely parallel manner.
○ Data structure (I/O)
■ Key-Value
38. Big Data Tools
Hadoop
● Features of Hadoop
○ Accessible: It runs on large clusters of commodity machines.
○ Robust: It handles most of failures.
○ Scalable: It scales linearity to handle large data by adding more node.
○ Simple: It allows you to quickly write efficient parallel code.
39. Big Data Tools
Hadoop: components
● Namenode
○ It applies the master/slave architecture. It is the master of HDFS that directs the
slave DataNode daemons to perform the low-level I/O tasks. [12]
○ It keeps track of files location (among nodes) and health of the distributed sys.
● DataNode
○ Each slave machine runs a DataNode daemon to perform the grunt work of the
distributed filesystem, that is, reading and writing HDFS blocks to actual files on
○ the local filesystem.
40. Big Data Tools
Hadoop: components
● Jobtracker
○ It is the liaison between your application and Hadoop - submit the code to cluster.
○ It determines the execution plan.
○ It holds an overall view of the system execution.
● Tasktracker
○ It manages the execution of individual tasks on each slave node.
○ It holds a local view (compared to Jobtracker).
41.
42.
43. Big Data Tools
Spark
● It is a fast and general engine for large-scale data processing. [13]
● In-memory processing
● A stack of libraries
○ Spark SQL
○ Spark Streaming
○ MLlib
○ GraphX
[http://spark.apache.org/]
44. Big Data Tools
Spark: performance
● It runs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk. [13]
○ Example: Logistic regression (see image below).
● Resilient Distributed Dataset (RDD)
○ It is an abstraction that enables developers to materialize any point in a processing
pipeline into memory across the cluster, meaning that future steps that want to
deal with the same data set need not recompute it or reload it from disk.
● It is well suited for highly iterative algorithms that require multiple passes over the data
set.
45. Big Data Tools
NoSQL
● Not only SQL / Non relational
○ It refers to a specific set of databases which have a mechanism for storage and
retrieval of data which that does not follow tabular relations used in common
relational databases.
● Why NoSQL?
○ Simplicity of design.
○ Finer control over availability.
○ Simpler "horizontal" scaling to clusters of machines
46. Big Data Tools
NoSQL
● NoSQL data structure
○ Key-Value
○ Wide column
○ Graph
○ Document
● Data structure of a NoSQL is more flexible.
● Compromise consistency in favor of: (Apply "eventual consistency")
○ Availability
○ Partition tolerance
○ Speed
47. Big Data Tools
NoSQL
● Drawbacks
○ Lack of standardized interfaces.
○ Low-level query languages.
○ Lost writes / Data loss
50. Big Data Tools
Big Data Ecosystem
[http://dataconomy.com/understanding-big-data-ecosystem/]
51. Big Data Tools
Think about Big Data Problem
● Task: how could big data tools fit to your problem?
○ Time: 20 minutes
○ We together discuss each problem.
■ Explain your problem in few lines.
■ Explain why your proposal might be a good solution for it.
○ Material:
■ Post it and pens
52. Practice
Programming Languages for Data science
● Python
○ It is a programming language with the following characteristics:
■ High-level;
■ General-purpose;
■ Interpreted language;
■ Dynamic programming;
■ Express concepts in fewer lines of code (i.e. compared to C++ and Java);
■ Indentation;
■ Cross-platform;
53. Practice
Programming Languages for Data science
● R
○ It is a software environment for statistical computing and graphics.
■ Features:
● Interpreted language;
● Widely used for statisticians and data miners;
● Several graphical-frontends available;
● Cross-platform;
54. Practice
Python packages for Machine Learning
● NumPy
○ Scientific computing:
■ A powerful N-dimensional array object.
■ Useful linear algebra, Fourier transform, and random number capabilities.
● Scikit-Learn
○ Machine learning:
■ Simple and efficient tools for data mining and data analysis.
● Scipy
○ Scientific computing and technical computing:
■ It depends on NumPy.
■ It provides many user-friendly and efficient numerical routines (e.g. numerical
integration)
55. Practice
Python packages for Machine Learning
● Pandas
○ It provides high-performance, easy-to-use data structures and data analysis tools.
■ Series
■ Dataframe
■ Panel
■ Panel4D / PanelND (Deprecated)
● Matplotlib
○ It is a 2D plotting library which produces publication quality figures in a variety of
hardcopy formats and interactive environments across platforms.
● There exists others packages in ML / Big Data world for Data science!
56. Practice
Anaconda
● It is a powerful collaboration and package management for open source and
private project.
○ Features:
■ Python and R programming;
■ Large-scale data processing;
■ Predictive analytics;
■ Scientific computing;
■ Simplify package management and deployment;
■ Cloud service
57. Practice
Jupyter Notebook
● It is a web application that allows you to create and share documents that
contain live code, equations, visualizations and explanatory text.
○ Features:
■ Web-based;
■ Interactive data science and scientific computing;
■ Support more than 40 programming languages;
■ Describe data analysis in a simple way;
■ Human-readable docs;
■ Big data integration
● Spark
58. Practice
Practice 01
● Introduction to Numpy
○ Array creation
○ Operations
○ Array transformations
○ Generate artificial data (Random sampling)
○ Statistical functions
● Find the introduction document on:
○ https://github.com/soutogustavo/data-science
■ Folder: Workshops / Cientec_2016_UFRN;
59. Practice
Practice 01
● Tasks:
○ Open Anaconda and start Jupyter Notebook
○ Create 2 numpy arrays (1x2 and 1x4) and perform:
■ Concatenate arrays;
■ Flat the concatenated array;
■ Reshape the array to 2x3;
○ Create 2 numpy arrays with 1000 samples:
■ Apply statistical functions (e.g. mean, var, std.);
● Additional Information:
○ Time: 30 minutes
○ Material:
■ Anaconda, Python and Numpy;
60. Practice
Practice 02
● Introduction to Pandas
○ Create Series
○ Create Dataframe
○ Generate artificial data
○ Transform Series to Dataframe
○ Load a Dataset
○ Drop column
○ Insert column
○ Statistical functions
● Find the introduction document on:
○ https://github.com/soutogustavo/data-science
■ Folder: Workshops / Cientec_2016_UFRN;
61. Practice
Practice 02
● Tasks:
○ Create Series from random sampling:
■ Number of samples: 500;
■ Apply statistical functions;
■ Transform Data Series into DataFrame;
○ Create DataFrame from random sampling (5 attributes):
■ Number of samples: 100;
■ Drop one column;
■ Create a label attribute and insert it to the dataframe;
○ Load data
■ source: http://www.jwall.org/streams/sample-stream.csv
■ Apply statistical functions for each attribute;
■ Find out the number of (possible) labels and count them;
● Additional Information:
○ Time: 40 minutes
○ Material:
■ Anaconda, Python,
Numpy and Pandas;
62. Conclusions
Next steps
● Books
○ Nathan, M. and Warren, J.. Big Data: principles and best practices of scalable
real-time data systems. Manning, 1st ed., 2015.
○ Lam, C. Hadoop in Action. Manning, 1st ed., 2015.
○ Karau, H. et. al. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly, 2015.
○ Lutz, M. Learning Python. O’Reilly, 5th ed., 2013.
64. References
1. IBM-01. What is big data? (2016). Retrieved from:
https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html.
2. Beal, V.. Structured Data (2016). Retrieved from:
http://www.webopedia.com/TERM/S/structured_data.html.
3. Driscoll, Michael E. How much data is "Big Data"? (2010). Retrieved from:
https://www.quora.com/How-much-data-is-Big-Data.
4. Zikopoulos et al.. Harness the Power of Big Data: The IBM Big Data
Platform. 2013. McGraw Hill. ISBN: 978-0-07-180817-0.
5. Wikipedia: Free Encyclopedia. Internet of Things (2016). Retrieved from:
https://en.wikipedia.org/wiki/Internet_of_things.
65. References
6. Wikipedia: Free Encyclopedia. Business Intelligence (2016). Retrieved from:
https://en.wikipedia.org/wiki/Business_intelligence.
7. Tutorials Points. Data Warehousing - Concepts (2016). Retrieved from:
http://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm.
8. Big Data University. Data Scientist vs Data Engineer, What’s the
difference? (2016) Retrieved from:
https://bigdatauniversity.com/blog/data-scientist-vs-data-engineer/.
9. Simon, P.. Too Big to Ignore: The Business Case for Big Data. Wiley. p.
89. March 18, 2013. ISBN 978-1-118-63817-0.
10. Mitchell, T.. Machine Learning. McGraw Hill, 1997. ISBN: 0070428077.
66. References
11. Han, J. and Kamber, M.. Data Mining: Concepts and Techniques. MK, 2nd
ed., 2006. ISBN-10: 1-55860-901-6.
12. Lam, C. Hadoop in Action. Manning, 2011. ISBN: 9781935182191.
13. Apache Spark. Spark (2016). Retrieved from: http://spark.apache.org/