The document provides an overview of a 3-day data analytics training program held in Jakarta, Indonesia from April 24-26, 2019. It discusses topics that will be covered including big data overview, data for business analysis, data analytics concepts, and data analytics tools. The training is led by Dr. Ir. John Sihotang and is aimed at management trainees of the company Sucofindo.
6. THEME OF THIS COURSE
Large-Scale Data Management
Big Data Analytics
Data Science and Analytics
• How to manage very large amounts of data and extract value and
knowledge from them
6
8. BIG DATA DEFINITION
• No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
8
9. CHARACTERISTICS OF BIG DATA:
1-SCALE (VOLUME)
• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
9
Exponential increase in
collected/generated data
10. CHARACTERISTICS OF BIG DATA:
2-COMPLEXITY (VARITY)
• Various formats, types, and structures
• Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting
many types of data
10
To extract knowledgeè all these types of
data need to linked together
11. CHARACTERISTICS OF BIG DATA:
3-SPEED (VELOCITY)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions è missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase
history, what you like è send promotions right now for store next to
you
– Healthcare monitoring: sensors monitoring your activities and
body è any abnormal measurements require immediate reaction
11
14. HARNESSING BIG DATA
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
14
15. WHO’S GENERATING BIG DATA
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data
in a timely manner and in a scalable fashion
15
16. THE MODEL HAS CHANGED…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
16
17. WHAT’S DRIVING BIG DATA
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
17
18. VALUE OF BIG DATA ANALYTICS
• Big data is more real-time in nature than
traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well-suited
for big data apps
• Shared nothing, massively parallel
processing, scale out architectures are
well-suited for big data apps
18
19. CHALLENGES IN HANDLING BIG DATA
• The Bottleneck is in technology
– New architecture, algorithms, techniques are needed
• Also in technical skills
– Experts in using the new technology and dealing with big data
19
21. 2
BASIC DEFINITION
qData: Data is a set of values of qualitative or
quantitative variables. It is information in raw
or unorganized form. It may be a fact, figure,
characters, symbols etc.
qInformation: Meaningful or organised data is
information
22. q Data terstruktur (structured data); yakni data yang sudah
dikelola, diproses dan dimanipulasi dalam RDBMS
(Relational Database Management System). Misalnya
data tabel hasil masukan formulir pendaftaran di sebuah
layanan web.
q Data tidak terstruktur (unscructured data); yakni berupa
data mentah yang baru didapat dari beragam jenis
aktivitas dan belum disesuaikan ke dalam format basis
data. Misalnya berkas video yang didapat dari kamera.
q Data semi terstruktur (semistructured data); yakni berupa
data yang memiliki struktur, misalnya berupa tag, akan
tetapi belum sepenuhnya terstruktur dalam sistem basis
data. Misalnya data yang memiliki keseragaman tag,
namun memiliki isian yang berbeda didasarkan pada
karakteristik pengisi.
JENIS DATA
23. TYPE OF DATA
Data types are an important concept because statistical methods can only be
used with certain data types. You have to analyze continuous data differently
than categorical data otherwise it would result in a wrong analysis. Therefore
knowing the types of data you are dealing with, enables you to choose the
correct method of analysis.
25. • Nominal or categorical data is data that comprises of categories that
cannot be rank ordered – each category is just different.
• The categories available cannot be placed in any order and no
judgement can be made about the relative size or distance from one
category to another.
Ø Categories bear no quantitative relationship to one another
Ø Examples:
ü customer’s location (America, Europe, Asia)
ü employee classification (manager, supervisor, associate)
• What does this mean? No mathematical operations can be performed
on the data relative to each other.
• Therefore, nominal data reflect qualitative differences rather than
quantitative ones.
Categorical (Nominal) data
26. Examples:
Nominal data
What is your gender? (please tick)
Male
Female
Did you enjoy the film? (please tick)
Yes
No
Nominal values represent
discrete units and are used to
label variables, that have no
quantitative value. Just think of
them as „labels“. Note that
nominal data that has no order.
Therefore if you would change
the order of its values, the
meaning would not change. You
can see two examples of
nominal features below:
27. q Systems for measuring nominal data must
ensure that each category is mutually
exclusive and the system of measurement
needs to be exhaustive.
q Variables that have only two responses
i.e. Yes or No, are known as
dichotomies.
Nominal data
28. Nominal data
When you are dealing with nominal data, you collect information through:
1. Frequencies: The Frequency is the rate at which something occurs over a
period of time or within a dataset.
2. Proportion: You can easily calculate the proportion by dividing the frequency
by the total number of events. (e.g how often something happened divided by
how often it could happen)
3. Percentage.
4. Visualization Methods: To visualize nominal data you can use a pie chart or a
bar chart.
29. § Ordinal data is data that comprises of categories that can
be rank ordered.
§ Similarly with nominal data the distance between each
category cannot be calculated but the categories can be
ranked above or below each other.
Ø No fixed units of measurement
Ø Examples:
ü college football rankings
ü survey responses (poor, average, good, very
good, excellent)
§ What does this mean? Can make statistical judgements and
perform limited maths.
Ordinal data
30. Example:
Ordinal data
How satisfied are you with the level of service you have
received? (please tick)
Very satisfied
Somewhat satisfied
Neutral
Somewhat dissatisfied
Very dissatisfied
31. Ordinal data
q When you are dealing with ordinal data, you can
use the same methods like with nominal data, but
you also have access to some additional tools.
q Therefore you can summarize your ordinal data
with frequencies, proportions, percentages.
q And you can visualize it with pie and bar charts.
Additionally, you can use percentiles, median,
mode and the interquartile range to summarize
your data.
q in Data Science, you can use one label encoding,
to transform ordinal data into a numeric feature.
32. q Both interval and ratio data are examples of scale data.
q Scale data:
• data is in numeric format ($50, $100, $150)
• data that can be measured on a continuous scale
• the distance between each can be observed and as a
result measured
• the data can be placed in rank order.
Interval and ratio data
33. Ø Ordinal data but with constant differences
between observations
Ø Ratios are not meaningful
Ø Examples:
§ Time – moves along a continuous measure
or seconds, minutes and so on and is
without a zero point of time.
§ Temperature – moves along a continuous
measure of degrees and is without a true
zero.
§ SAT scores
Interval data
34. • Ratio data measured on a continuous
scale and does have a natural zero point.
Ø Ratios are meaningful
Ø Examples:
• monthly sales
• delivery times
• Weight
• Height
• Age
Ratio data
35. 35
q When you are dealing with continuous data, you can use the most
methods to describe your data. You can summarize your data using
percentiles, median, interquartile range, mean, mode, standard
deviation, and range.
q Visualization Methods: To visualize continuous data, you can use a
histogram or a box-plot. With a histogram, you can check the central
tendency, variability, modality, and kurtosis of a distribution. Note that a
histogram can’t show you if you have any outliers. This is why we also
use box-plots.
Continuous Data
36. 36
q you discovered the different data types that are used
throughout statistics.
q You learned the difference between discrete & continuous
data and learned what nominal, ordinal, interval and ratio
measurement scales are.
q Furthermore, you now know what statistical measurements
you can use at which data type and which are the right
visualization methods.
q You also learned, with which methods categorical variables
can be transformed into numeric variables.
q This enables you to create a big part of an exploratory
analysis on a given dataset.
Summary
38. 2
BASIC DEFINITION
q Analytics: Analytics is the discovery , interpretation, and
communication of meaningful patterns or summery in
data.
q Data Analytics (DA) is the process of examining data
sets in order to draw conclusion about the information it
contains.
q Analytics is not a tool or technology, rather it is the way
of thinking and acting on data.
39. WHAT IS ANALYTICS?
Data on its own is useless unless you can make sense of it!
WHAT IS ANALYTICS?
The scientific process of transforming data into insight for making better
decisions, offering new opportunities for a competitive advantage
www.imarticus.org 39
40. The Case for Business Analytics
• The Business environment today is
more complex than ever before.
• Businesses are expected to be diligently
responsive to the increasing demands of
customers, various stakeholders and
even regulators.
• Organizations have been turning to the
use of analytics.
• More than 83% of Global CIOs surveyed
by IBM in 2010 singled out Business
Intelligence and Analytics as one of their
visionary plans for enhancing
competitiveness.
In most cases the primary objective of an
organization that seeks to turn to analytics
is:
• Revenue/Profit growth
• Optimize expenditure
SOLUTION
BUSINESS NEED
GOAL
www.imarticus.org 40
41. 6
WHAT IS DATA ANALYTICS?
• Data Analytics:
– “is a process of inspecting, cleansing, transforming, and modeling data
with the goal of discovering useful information, suggesting conclusions,
and supporting decision-making”. - Wikipedia
– "leverage data in a particular functional process (or application) to
enable context-specific insight that is actionable.“ – Gartner
– “is using our current data sets to extract useful information to
support advanced decision making” - ATC
• Data Visualizations (i.e. Data Viz):
– Visual context of data, Dashboards
– Often single page, real-time user interface, graphical presentation of
your data
“Without data you’re just another person with an opinion.” ~W. Edwards Deming
43. 11
DATA ANALYTICS ARCHITECTURE
Analysis
Visualization
Translation
Cleaning Processing
Interpretation
Preparation
Data Layer
Integration Data Collection
Business
Information
Technology
Partnership and
Stewardship
Wisdom
Information &
Knowledge
Data
DataRulesTools
Action
Action
“The greatest value of a picture is when it forces to notice what we never expected to
see.” ~John Tukey
44. TYPES OF ANALYTICS
1. Descriptive Analytics; analisis ini mengacu pada histori data
sekaligus data yang ada saat ini. Umumnya digunakan untuk menjawab
pertanyaan semacam “Apa yang terjadi dengan ABC?”, “Apa yang terjadi
jika XYZ?”, dan sebagainya.
2. Diagnostic Analytics; analisis ini digunakan untuk menyimpulkan
kejadian berdasarkan lansiran data terkait. Digunakan untuk menjawab
pertanyaan semacam “Mengapa ABC terjadi saat XYZ?”, “Apa yang salah
dengan strategi DEF?”, dan sebagaiya.
3. Predictive Analytics; analisis ini mencoba menyimpulkan sebuah tren
dan kejadian di masa depan mengacu pada data-data historis yang ada.
Model ini cenderung lebih kompleks dari dua tipe sebelumnya, karena
memerlukan pemodelan dan analisis yang lebih mendalam.
4. Prescriptive Analytics; analisis ini digunakan untuk mengoptimalkan
proses, struktur dan sistem melalui informasi yang dihasilkan dari Predictive
Analytics. Pada dasarnya memberi tahu kepada bisnis tentang hal apa yang
perlu dilakukan untuk mengantisipasi kejadian yang ada datang.
45. BDA-45
Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
DESCRIPTIVE ANALYTICS
• Process:
– Identify the attributes, then assess/evaluate the attributes
– Estimate the magnitude to correlate the relative contribution of each attribute to the final
solution
– Accumulate more instances of data from the data sources
– If possible, perform the steps of evaluation, classification and categorization quickly
– Yield a measure of adaptability within the OODA loop
• At some threshold, crossover into diagnostic and predictive analytics
http://v1shal.com/content/25-
cartoons-give-current-big-data-
hype-perspective/
46. 46
DIAGNOSTIC ANALYTICS
• Process:
– Begin with descriptive analytics
– Extract patterns from large data quantities via data mining
– Correlate data types for explanation of near-term behavior – past and present
– Estimate linear/non-linear behavior not easily identifiable through other
approaches.
• Example: by classifying past insurance claims, estimate the number of
future claims to flag for investigation with a high probability of being
fraudulent.
Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
47. PREDICTIVE ANALYTICS
• Process:
– Begin with descriptive AND diagnostic analytics
– Choose the right data based on domain knowledge and relationships among
variables
– Choose the right techniques to yield insight into possible outcomes
– Determine the likelihood of possible outcomes given initial boundary conditions
– Remember! Data driven analytics is non-linear; do NOT treat like an engineering
project
47
48. BD
A-
48
PRESCRIPTIVE ANALYTICS
• Process:
– Begin w/ predictive analytics
– Determine what should occur and how to make it so
– Determine the mitigating factors that lead to desirable/undesirable outcomes
– “What-if” analysis w/ local or global optimization
– Ex: Find the best set of prices and advertising frequency to maximize revenue
– Ex: And, the right set of business moves to make to achieve that goal
“Make it so”
Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
49. BDA-49
DECISIVE ANALYTICS
Ø Process:
• Given a set of decision
alternatives, choose the one
course of action to do from
possibly many
• But, it may not be the optimal one.
• Visualize alternatives – whole or
partial subset
• Perform exploratory analysis –
what-if and why
– How do I get to there from here?
– How did I get here from there?
Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
54. 54
q Analytics deskriptif adalah proses data analytics untuk mendapatkan
gambaran umum dari data yang sudah dikumpulkan.
q Contoh dari analytcs desckriptif adalah Google Analytics. Pada
Google Analytics kita hanya bisa melihat informasi sederhana seperti
ada berapa jumlah visitor per satuan waktu, halaman mana saja
yang paling sering dikunjungi, dan data seperti itu.
q Pada analytics sederhana seperti penjumlahan dan rata-rata tanpa
machine learning sudah lebih dari cukup.
q Analytics deskriptif tidak menampilkan prediksi halaman apa yang
akan dikunjungi pengunjung berikutnya atau kenapa seorang
pengunjung mengunjungi suatu halaman.
q Data analytics jenis ini adalah yang paling umum ditemui. Meskipun
hanya data sederhana tanpa pengolahan machine learning, data
seperti ini sangat diperlukan terutama untuk melakukan
benchmarking untuk mengetahui efek dari perubahan yang kita
lakukan.
Analytics deskriptif (Descriptive Analytics)
55. 55
q Deskriptif model mengukur hubungan data dalam suatu cara yang
sering digunakan untuk mengklasifikasikan pelanggan atau prospek
menjadi kelompok-kelompok.
q Deskriptif model tidak rank-order pelanggan dengan kemungkinan
mereka untuk mengambil tindakan tertentu cara model prediksi.
q Sebaliknya, deskriptif model ini dapat digunakan, misalnya, untuk
mengkategorikan pelanggan dengan produk mereka preferensi dan
tahap kehidupan.
q Deskriptif alat pemodelan dapat dimanfaatkan untuk
mengembangkan lebih lanjut model yang dapat mensimulasikan
sejumlah besar agen individual dan membuat prediksi.
Model Deskriptif
60. Descriptive Analytics
What has occurred?
Descriptive analytics, such as data
visualization, is important in helping users
interpret the output from predictive and
predictive analytics.
• Descriptive analytics, such as reporting/OLAP,
dashboards, and data visualization, have been
widely used for some time.
• They are the core of traditional BI.
65. Predictive Analytics
What will occur?
• Marketing is the target for many predictive analytics
applications.
• Descriptive analytics, such as data visualization, is important
in helping users interpret the output from predictive and
prescriptive analytics.
• Algorithms for predictive analytics, such as regression analysis,
machine learning, and neural networks, have also been around
for some time.
• Prescriptive analytics are often referred to as advanced analytics.
70. Prescriptive Analytics
What should occur?
• For example, the use of mathematical programming for revenue management is
common for organizations that have “perishable” goods (e.g., rental cars, hotel
rooms, airline seats).
• Harrah’s has been using revenue management for hotel room pricing for some
time.
• Prescriptive analytics are often referred to as advanced analytics.
• Regression analysis, machine learning, and neural networks
• Often for the allocation of scarce resources
71. Know Your Tools &
Why Learn About
Them?
DATA ANALYTICS TOOLS
73. TOOLS COVERED IN PROGRAM
The program is developed keeping in mind the needs of an evolving Analytics industry that
requires individuals to be “job-ready” from Day 1.
www.imarticus.org 73
74. Why SAS?
The largest independent
vendor in the business
intelligence market
The De facto industry
standard for Clinical Data
Analysis
##11Market LeaderMarket Leader
in Analyticsin Analytics
Used in 60,000+
companies in
over 135
countries
“Analytics powerhouse”
INTEGRATED PLATFORM FOR END TO END SOLUTIONS:
SAS provides an integrated set of software products and services
and integrated technologies for information management,
advanced analytics and reporting.
BUSINESS SOLUTIONS ACROSS DOMAINS AND INDUSTRIES:
Unmatched domain specific industry focused analytics solutions
The Forrester Wave™: Big Data Predictive Analytics Solutions, Q1 2013 74
75. Why R?
R is the #1 Google Search for Advanced Analytics software Google
Trends, April 2016
Highest Paid IT
Skill
Linkedin Skills and
O'Reilly Survey, 2016
Most-used data
science language
after SQL
O’Reilly Survey,
Jan 2014
75% of data
professionals
use R
Rexer Survey,
Oct 2015 Second best
programming
languages for data
science
O'Reilly Survey, 2016
Supports close to
10,000 free
packages
CRAN Figure as on
December 2016
R is #13 of all Programming Languages
Redmonk Language Ratings, June 2015
Demand for R language skills is on the rise.
BCG
Uber
Lloyds of London
& Many More…
Companies Already Onboard R
Facebook
Google
Twitter
McKinsey
ANZ Bank
www.imarticus.org 75
76. What is Hadoop?
Hadoop is TransformingHadoop is Transforming
Businesses AcrossBusinesses Across
IndustriesIndustries
“The growing use of Apache Hadoop, increasing data warehouse volume sizes and the
accumulation of legacy systems in organizations are fostering structured data growth. These
factors are leading enterprises to understand how to reuse, repurpose and gain critical insight
from this data.” Gartner
BIG DATA STORING AND FASTER PROCESSING
Hadoop is an open source software framework created in 2005
that keeps and processes big data in a distributed manner on
large collection of hardware.
Organizations use Hadoop to
manage their data today
(up from 1 out of 10 in 2012)
1 in 4
BUSINESS SOLUTIONS ACROSS DOMAINS AND INDUSTRIES:
Low cost solution with a high fault tolerance to access and create
value from data.
www.imarticus.org 76
77. Hadoop – Big Data is a comprehensive class room training program that enables you to analyse data and
create useful information for careers in Data Analytics.
WHY HADOOP?
ScalabilityComputing
Power
Top 5 Reasons Organizations are using Hadoop
Low Cost
Storage
Flexibility
Data
Protection
Top 5 Industries using Hadoop:
• Computer Manufacturing
• Business Services
• Finance
• Retail & Wholesale
• Education & Government
Enterprises using Hadoop
www.imarticus.org 77
79. WHY PYTHON?
Cost of
Ownership
Python is an open source software
that is free to download. Versatility
Multi-purpose language that can
be used to build an entire
application
Python is a powerful, flexible, open-source language that is easy to learn, easy to use, and has
powerful libraries for data manipulation and analysis
What are the reasons for its sudden popularity?
A Data Scientists’ Dream
Python is particularly useful in data analytics because
it has a rich library for reading and writing data,
running calculations on the information and creating
graphical representations of data sets.
We can write map reduce programs in python using
PyDoop. Here is where Python scores over R. While R
uses in-memory processing, Python using PyDoop
can process PetaBytes of data
Python offers extensive
analytics capabilities for
Text & Predictive Analytics.
IDLE & Spyder IDE is
widely used for data mining.
Big Data Analytics made
possible by PyDoop and
Scipy
In industry, the data science
trend shows increasing
popularity of Python. A
Python-based application
stack can more easily
integrate a data scientist who
writes Python code, since that
eliminates a key hurdle in
productionizing a data
scientist's work.
Integration
Big data
compatibility
Python has become one of the big go-to languages for big data processing due to its wide
selection of libraries
www.imarticus.org 79
80. WHY PYTHON?
Official
language of
Google
Among top
in-demand data
science skills
KDNuggets,
Dec2014
46% of job ads
mention
Python
(after SQL)
KDNuggets
Dec 2014 Ranked #1 of
all programming
languages
Codeeval rankings,
Feb 2015
2nd most
popular data
science language
KDNuggets 2013
Google
Yahoo
Quora
Nokia
ABN
AMRO Bank
IBM
National Weather
Service
& Many More…
Companies Already
Onboard Python
www.imarticus.org 80
81. WHAT IS DATA VISUALIZATION?
Data visualization is the presentation of data in a pictorial or graphical format. For centuries, people
have depended on visual representations such as charts and maps to understand information more
easily and quickly.
81
82. WHY TABLEAU FOR DATA VISUALIZATION?
Cost of
Ownership
Tableau is a competitively priced
software that is available for a trial
download.
Versatility
Multi-purpose package that can be
used to build an entire application
Tableau is a powerful, flexible Data Visualization tool that is easy to learn, easy to use, and has
powerful libraries for data visualization and presentation.
Big data
compatibility
Tableau has become one of the big go-to software programs for Data visualization due to the
wide variety of tools it provides and compatibility with Big Data platforms such as Hadoop.
82
83. WHY TABLEAU FOR DATA VISUALIZATION?
A BUSINESS ANALYSTS’ DREAM
Tableau is easy to learn, use, and significantly faster
than existing solutions. One can easily see patterns,
identify trends and discover visual insights in
seconds. No wizards, no scripts.
Tableau facilitates live, up-to-date data analysis that
taps into the power of the firm’s data warehouse.
Extract data into Tableau’s data engine and take
advantage of breakthrough in-memory architecture.
Tableau offers Powerful
visualization capabilities,
without a single line of code.
Experiment with trend
analyses, regressions,
correlations.
Scalable, secure and
Reliable Cloud and Mobile
Connectivity.
Tableau integrates
exceptionally well with R and
Hadoop, making it a powerful
visualization tool for analytics
and big data use cases.
Developers creating web
applications can integrate fully
interactive Tableau content
into their applications via the
JavaScript API.
INTEGRATION
www.imarticus.org 83