2. Outlines
1. Pengenalan Data Science
2. Phenomena and Definition of Big Data
3. Platforms, Technology, Tool, dan Method in Big Data Analysis
4. Implementations and Research in Big Data
3. Introduction to Data Science
• Data science is the study that focuses on knowledge extraction from
data: data collection, preparation, analysis, visualization,
management, recommendation, etc.
• Data science is an interdisciplinary field that requires hacking skills
(i.e., programming), math and statistics knowledge, and substantive
expertise in a field of science.
4. Processes in Data Science
1. Objectives: asking the right questions
to find what the problem is.
2. Data Collection: Get Relevant Data for
Analysis of the Problem.
3. Data Preprocessing: Explore the Data
to Make Error Corrections (cleaning
and organizing).
4. Computational and Data model:
Descriptive, predictive, etc.
5. Reporting/Dissemination/Publication.
Data Science: Software and
Implementations|4
5. Final Goals in Data Analysis
1. Decision analytics: supports decision-making with visual analytics
that reflect reasoning.
2. Descriptive analytics: provides insight from historical data with
reporting, score cards, clustering, etc.
3. Predictive analytics: employs predictive modeling using statistical
and machine learning techniques.
4. Prescriptive analytics: recommends decisions using optimization,
simulation, etc.
Data Science: Software and
Implementations|5
6. Phenomena of Big Data
Volume of data digital 2010 to 2025 (in zettabytes 1021 bytes).
11. What is Big Data?
1.Volume: The huge amounts of data being
stored.
2.Velocity: The lightning speed at which data
streams must be processed and analyzed.
3.Variety: The different sources and
forms from which data is collected, such as
numbers, text, video, images, audio and text.
18. The Issues on Big Data Technologies:
1. Computational Models: How the data are
processed and analyzed Data Analysis/Data
Science
2. Database/storage Frameworks: focuses on
technologies and mechanisms to write, read, and
manage Big Data efficiently. Furthermore,
handling fault tolerance, availability, consistency,
scalability, and heterogeneity of Big Data should
be considered as well
21. Big Data Platforms
• Redundant and Reliable: Platforms can replicates data automatically,
so when machine goes down there is no data loss.
• Runs on commodity hardware: Don’t have to buy special hardware,
expensive RAIDs, or redundant hardware; reliability is built into
software.
• Scale-Out rather than Scale-UP.
• Bring code to data rather than data to code.
• Fault tolerant/Deal with failures.
• Break disk read barrier.
24. • In April 2008, Hadoop broke a world record to become the
fastest system to sort an entire terabyte of data. Running on
a 910-node cluster, Hadoop sorted 1 terabyte in 209
seconds (just under 3.5 minutes), beating the previous year’s
winner of 297 seconds.
• In November of the same year, Google reported that its
MapReduce implementation sorted 1 terabyte in 68
seconds.
• Then, in April 2009, it was announced that a team at Yahoo!
had used Hadoop to sort 1 terabyte in 62 seconds.
• In the 2014 competition, a team from Databricks were joint
winners of the Gray Sort benchmark. They used a 207-node
Spark cluster to sort 100 terabytes of data in 1,406 seconds,
a rate of 4.27 terabytes per minute.
27. Hadoop Distributed File Systems (HDFS)
• HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
of commodity hardware.
• Very large files: hundreds of megabytes, gigabytes, or terabytes
in size.
• Streaming data access: a write once, read-many-times pattern.
• Commodity hardware: run on clusters of commodity hardware.
• HDFS is not a good fit:
• Low-latency data access
• Lots of small files
29. Implementations of Big Data Analysis
• Google: using Big Data for searching,
recommendation, etc.
• Amazon: Big Data resulted from collecting
customers’ behaviors for recommendation
system.
• Facebook: using Big Data Analysis for image
recognition when tagging, deepfakes, People
You May Know, dll.
30. Related paper to Big Data
1. Riza, L. S., Pratama, F. D., Piantari, E., & Fashi, M. (2020). Genomic
repeats detection using Boyer-Moore algorithm on Apache Spark
Streaming. Telkomnika, 18(2), 783-791.
2. Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2020). Big data in education:
a state of the art, limitations, and future research directions.
International Journal of Educational Technology in Higher Education,
17(1), 1-23.
3. Mayabee, T. T., Khan, S., Alam, A., Amin, S., Chowdhury, J. K., Hassan, M.
T., ... & Hasan, M. (2022). Student Performance Monitor: A Big Data
Analytical Application. In Proceedings of International Conference on
Data Science and Applications (pp. 759-771). Springer, Singapore.
31. Big Data in Bioinformatics
Riza, L. S., Pratama, F. D., Piantari, E., & Fashi, M. (2020). Genomic repeats
detection using Boyer-Moore algorithm on Apache Spark Streaming. Telkomnika,
18(2), 783-791.
32. Genomic repeats detection using Boyer-
Moore algorithm on Apache Spark Streaming
• Repetition identification and
classification are important
fundamental annotation tasks
because of the evolution of
genomes and diseases and
distinguish from other gene
types.
• A task of genomic repeats, which
basically is an analysis of string
matching or pattern matching, is
carried out to look for a pattern in
a large text.
33. Research Objective
• This research is aimed at building a big-data computational model
and implementing the Boyer Moore algorithm in finding string
patterns in human chromosome genome data contained in ensemble
pages.
• Apache Spark is an open-source cluster computing framework for
large data processing.
34. Research Method in
Genomic Repeats
• 4 working environments:
• In personal computers
• On virtual machines in google cloud
project
• On HDFS
• With apache spark streaming
• Data collection (round 3.9GB):
Human DNA sequences which can
be downloaded freely on page
ftp://ftp.Ensembl.Org/pub/release-
95/fasta/homo_sapiens/dna/.
36. Big Data in Education
Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2020). Big data in education: a state of
the art, limitations, and future research directions. International Journal of
Educational Technology in Higher Education, 17(1), 1-23.
37. Big data in education
• In the educational realm, a large volume of data is produced through
online courses, teaching and learning activities.
• Academic data can help teachers to analyze their teaching pedagogy
and affect changes according to students’ needs and requirement.
• The large-scale administrative data can play a tremendous role in
managing various educational problems.
• Therefore, it is essential for professionals to understand the
effectiveness of big data in education in order to minimize
educational issues
40. Student Performance Monitor:
A Big Data Analytical Application
Mayabee, T. T., Khan, S., Alam, A., Amin, S., Chowdhury, J. K., Hassan, M. T., ... &
Hasan, M. (2022). Student Performance Monitor: A Big Data Analytical Application.
In Proceedings of International Conference on Data Science and Applications (pp.
759-771). Springer, Singapore.
41. Objectives
• To analyze Program Learning Outcome (PLO) in Outcome Based
Education (OBE) by using Big Data Analytics.
The outcome-based education (OBE) system
is an educational theory where every part of
the curriculum is centered around outcomes
or goals that a student must accomplish to
successfully complete their program.
44. Other Example: Data Analysis in Education
Real World Sensor 1
Sensor k
…
…
Non-Text
Data
Text
Data
Joint Mining
of Non-Text
and Text
Predictive
Model
Multiple
Predictors
(Features)
…
Predicted Values
of Real World Variables
Change the World Teacher
Student
45. Big Data for Education
Scalability
Quality
MOOC
Small Classrooms
“Big Data Technology”
Scalable Intelligent MOOC
Automate grading with machine learning
Automate question answering on forums
Towards
Intelligent
MOOC
47. References
• Baig, M. I., Shuib, L., & Yadegaridehkordi, E. (2020). Big data in education: a state of the art,
limitations, and future research directions. International Journal of Educational Technology in
Higher Education, 17(1), 1-23.
• Big Data Education System Leaderboard, Universy of Illinios at Urbana-Champaign, The Data and
Information Systems Laboratories,
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2
ahUKEwiNrurljs7yAhUL8HMBHR89Ag8QFnoECAIQAQ&url=http%3A%2F%2Ftimes.cs.uiuc.edu%2F
czhai%2Fpub%2Fbigdata-education-zhai.pptx&usg=AOvVaw30IHA6b1UxmFFK0SXCA5hr
• Favaretto, M., De Clercq, E., Schneble, C. O., & Elger, B. S. (2020). What is your definition of Big
Data? Researchers’ understanding of the phenomenon of the decade. PloS one, 15(2), e0228987.
• Mayabee, T. T., Khan, S., Alam, A., Amin, S., Chowdhury, J. K., Hassan, M. T., ... & Hasan, M. (2022).
Student Performance Monitor: A Big Data Analytical Application. In Proceedings of International
Conference on Data Science and Applications (pp. 759-771). Springer, Singapore.
• Riza, L. S., Pratama, F. D., Piantari, E., & Fashi, M. (2020). Genomic repeats detection using Boyer-
Moore algorithm on Apache Spark Streaming. Telkomnika, 18(2), 783-791.