BIG DATA ANALYTICS
QUICK RESEARCH GUIDE
BY ARTHUR MORGAN
with
Art’s Talking Points™
FAIR USE NOTICE
This Quick Research Guide is for non-commercial educational and informational purposes only.
This presentation may contain copyrighted material owned by a third party, the use of which has not been specifically
authorized by the copyright owner. Notwithstanding a copyright owner's rights under Section 107 of the Copyright Act of
1976, the Act allows limited use of copyrighted material without requiring permission from the rights holders, for
purposes such as education, criticism, comment, news reporting, teaching, scholarship, and research. These so-called "fair
uses" are permitted even if the use of the work would otherwise be infringing.
If you wish to use copyrighted material published in this presentation for your own purposes that go beyond fair use, you
must obtain permission from the copyright owner. It is recommended that you seek the advice of legal counsel if you have
any questions on this point.
If you believe that any content in this presentation violates your intellectual property or other rights, please notify Arthur
Morgan by email to art_morgan@att.net.
IMAGE SOURCE: jean-guichard.com
PROLOGUE: HOW BIG IS BIG DATA?
In 2024, the World
Wide Web contained
roughly 150 billion
terabytes of data.
IMAGE SOURCE: reddit.com
DATA ANALYTICS
o Data analytics is numerical detective
work.
o Basic data analytical tools:
o Web Browser
o Microsoft Excel
o Linear Algebra
o Artificial Intelligence
o What story is the data telling us?
ARTIFICIAL INTELLIGENCE
IMAGE SOURCE: forbes.com
DATA ANALYTICS CAN START A MOVEMENT
IMAGE SOURCE: reddit.com
CONTENTS
Introduction
o Billions of Terabytes
o The Zettabyte Era
o Data File Formats & Data Plots
o Data Science
o Computer Science
o Math & Statistics
o Domain Knowledge
o Apache Hadoop
o Hadoop Distributed File System
o MapReduce
o Big Data Analytics Milestones
Research Resources
o John Tukey
o Edward Tufte
o John Mashey
o Sanjay Ghemawat
Conclusion
o What are Big Data Analytics Good for?
o Key Takeaways
o Book Recommendations
IMAGE SOURCE: economist.com
BILLIONS OF TERABYTES: THE ZETTABYTE ERA
oIt has been almost two decades
since Big Data became a thing.
oGiven the sheer size of the World
Wide Web, it continues to be the
mother lode of data to be mined.
oBig Data is still the new oil.
IMAGE SOURCE: medium.com
DATA FILE FORMATS
oJSON: JavaScript Object Notation
oCSV: Comma-Separated Values
oApache Parquet
oApache Avro
oApache ORC: Optimized Row
Columnar
DATA PLOTS
IMAGE SOURCE: kuan-liu.com
IMAGE SOURCE: r-charts.com
Scatter Plot & Box Plot Nightingale Rose Chart
IMAGE SOURCE: researchgate.net
DATA SCIENCE
oComputer Science
oMath & Statistics
oDomain Knowledge
oExample Data Science Project
IMAGE SOURCE: amazon.com
APACHE HADOOP
oHadoop Distributed File System:
Provides efficient, reliable access to
large datasets.
oMapReduce: Programming model
for generating and processing large
datasets.
IMAGE SOURCE: mailjet.com
BIG DATA ANALYTICS MILESTONES
1960s
The Future of Data Analysis (1962)
1980s
World Wide Web (1989)
1990s
Big Data (1998)
2000s
Google File System (2003)
Google MapReduce (2004)
2010s
The Zettabyte Era (2016)
IMAGE SOURCE: amazon.com
JOHN TUKEY
oThe Future of Data Analysis
IMAGE SOURCE: amazon.com
EDWARD TUFTE
oThe Visual Display of Quantitative In
formation
IMAGE SOURCE: usenix.org
JOHN MASHEY
oBig Data … and the Next Wave of
InfraStress
oOn the Origin(s) and Development o
f “Big Data”: The Phenomenon, the
Term, and the Discipline
IMAGE SOURCE: facesofopensource.com
SANJAY GHEMAWAT
oThe Google File System
oMapReduce: Simplified Data Process
ing on Large Clusters
IMAGE SOURCE: netsuite.com
WHAT ARE BIG DATA ANALYTICS GOOD FOR?
oInsight Discovery
oDecision Making
oPattern Recognition
oHistorical Context
oFuture Predictions
IMAGE SOURCE: tcgdigital.com
KEY TAKEAWAYS
oAll data has a story to tell.
oBasic data analytical tools, such as
Microsoft Excel, are straightforward
and ubiquitous.
oThe World Wide Web contains the
mother lode of data.
IMAGE SOURCE: amazon.com
IMAGE SOURCE: amazon.com
BOOK RECOMMENDATIONS
AVAILABLE AT OTHER SJPL
BRANCHES (CALL NO. 005.7565 GRUS):
AVAILABLE AT OTHER SJPL
BRANCHES (CALL NO. 005.74 WHITE):
IMAGE SOURCE: jean-guichard.com
EPILOGUE: BIG DATA IS GETTING BIGGER
By 2030, the World
Wide Web is projected
to contain over 600
zettabytes of data.
Thanks!
Big Data Analytics Quick Research Guide by Arthur Morgan

Big Data Analytics Quick Research Guide by Arthur Morgan

  • 1.
    BIG DATA ANALYTICS QUICKRESEARCH GUIDE BY ARTHUR MORGAN with Art’s Talking Points™
  • 2.
    FAIR USE NOTICE ThisQuick Research Guide is for non-commercial educational and informational purposes only. This presentation may contain copyrighted material owned by a third party, the use of which has not been specifically authorized by the copyright owner. Notwithstanding a copyright owner's rights under Section 107 of the Copyright Act of 1976, the Act allows limited use of copyrighted material without requiring permission from the rights holders, for purposes such as education, criticism, comment, news reporting, teaching, scholarship, and research. These so-called "fair uses" are permitted even if the use of the work would otherwise be infringing. If you wish to use copyrighted material published in this presentation for your own purposes that go beyond fair use, you must obtain permission from the copyright owner. It is recommended that you seek the advice of legal counsel if you have any questions on this point. If you believe that any content in this presentation violates your intellectual property or other rights, please notify Arthur Morgan by email to art_morgan@att.net.
  • 3.
    IMAGE SOURCE: jean-guichard.com PROLOGUE:HOW BIG IS BIG DATA? In 2024, the World Wide Web contained roughly 150 billion terabytes of data.
  • 4.
    IMAGE SOURCE: reddit.com DATAANALYTICS o Data analytics is numerical detective work. o Basic data analytical tools: o Web Browser o Microsoft Excel o Linear Algebra o Artificial Intelligence o What story is the data telling us?
  • 5.
  • 6.
    DATA ANALYTICS CANSTART A MOVEMENT IMAGE SOURCE: reddit.com
  • 7.
    CONTENTS Introduction o Billions ofTerabytes o The Zettabyte Era o Data File Formats & Data Plots o Data Science o Computer Science o Math & Statistics o Domain Knowledge o Apache Hadoop o Hadoop Distributed File System o MapReduce o Big Data Analytics Milestones Research Resources o John Tukey o Edward Tufte o John Mashey o Sanjay Ghemawat Conclusion o What are Big Data Analytics Good for? o Key Takeaways o Book Recommendations
  • 8.
    IMAGE SOURCE: economist.com BILLIONSOF TERABYTES: THE ZETTABYTE ERA oIt has been almost two decades since Big Data became a thing. oGiven the sheer size of the World Wide Web, it continues to be the mother lode of data to be mined. oBig Data is still the new oil.
  • 9.
    IMAGE SOURCE: medium.com DATAFILE FORMATS oJSON: JavaScript Object Notation oCSV: Comma-Separated Values oApache Parquet oApache Avro oApache ORC: Optimized Row Columnar
  • 10.
    DATA PLOTS IMAGE SOURCE:kuan-liu.com IMAGE SOURCE: r-charts.com Scatter Plot & Box Plot Nightingale Rose Chart
  • 11.
    IMAGE SOURCE: researchgate.net DATASCIENCE oComputer Science oMath & Statistics oDomain Knowledge oExample Data Science Project
  • 12.
    IMAGE SOURCE: amazon.com APACHEHADOOP oHadoop Distributed File System: Provides efficient, reliable access to large datasets. oMapReduce: Programming model for generating and processing large datasets.
  • 13.
    IMAGE SOURCE: mailjet.com BIGDATA ANALYTICS MILESTONES 1960s The Future of Data Analysis (1962) 1980s World Wide Web (1989) 1990s Big Data (1998) 2000s Google File System (2003) Google MapReduce (2004) 2010s The Zettabyte Era (2016)
  • 14.
    IMAGE SOURCE: amazon.com JOHNTUKEY oThe Future of Data Analysis
  • 15.
    IMAGE SOURCE: amazon.com EDWARDTUFTE oThe Visual Display of Quantitative In formation
  • 16.
    IMAGE SOURCE: usenix.org JOHNMASHEY oBig Data … and the Next Wave of InfraStress oOn the Origin(s) and Development o f “Big Data”: The Phenomenon, the Term, and the Discipline
  • 17.
    IMAGE SOURCE: facesofopensource.com SANJAYGHEMAWAT oThe Google File System oMapReduce: Simplified Data Process ing on Large Clusters
  • 18.
    IMAGE SOURCE: netsuite.com WHATARE BIG DATA ANALYTICS GOOD FOR? oInsight Discovery oDecision Making oPattern Recognition oHistorical Context oFuture Predictions
  • 19.
    IMAGE SOURCE: tcgdigital.com KEYTAKEAWAYS oAll data has a story to tell. oBasic data analytical tools, such as Microsoft Excel, are straightforward and ubiquitous. oThe World Wide Web contains the mother lode of data.
  • 20.
    IMAGE SOURCE: amazon.com IMAGESOURCE: amazon.com BOOK RECOMMENDATIONS AVAILABLE AT OTHER SJPL BRANCHES (CALL NO. 005.7565 GRUS): AVAILABLE AT OTHER SJPL BRANCHES (CALL NO. 005.74 WHITE):
  • 21.
    IMAGE SOURCE: jean-guichard.com EPILOGUE:BIG DATA IS GETTING BIGGER By 2030, the World Wide Web is projected to contain over 600 zettabytes of data.
  • 22.

Editor's Notes

  • #1 - This is a Quick Research Guide (QRG). - QRGs include the following: - A brief, high-level overview of the QRG topic. - A milestone timeline for the QRG topic. - Links to various free online resource materials to provide a deeper dive into the QRG topic. - Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic. - QRGs planned for the series: - Artificial Intelligence QRG - Quantum Computing QRG - Big Data Analytics QRG - Spacecraft Guidance, Navigation & Control QRG (coming 2026) - UK Home Computing & The Birth of ARM QRG (coming 2027) - Any questions or comments? - Please contact Arthur Morgan at art_morgan@att.net. - 100% human made.
  • #2 - This QRG is for non-commercial educational and informational purposes only.
  • #3 - Let’s be clear. I am not a statistician. I do not celebrate Pi Day. I am not a data scientist. I am a computer scientist. - There are roughly 200 billion trillion stars in the universe. - The World Wide Web contains roughly 150 billion trillion bytes of data. - For this discussion we will quantify Big Data in terms of the World Wide Web. - Yes, there are plenty of private databases and intranets that are a part of Big Data. - We’re going to focus on the 600-pound gorilla.
  • #4 - For fans of Severance, the work is mysterious and important. - All data are numbers. All data tells a story. - Visualizing data is key to revealing the story. - The most basic analytical tool is pencil & paper. - We do not guarantee to introduce you to the best tools. - Particularly since we are not sure that there can be unique bests. - Simple end-to-end example: - Count/Tally (always positive). - Organize/Categorize/Sort. - Plot.
  • #5 - AI is a powerful tool for data analytics.
  • #6 - A good example of story telling is Al Gore’s data analysis of the carbon apocalypse.
  • #7 - Scope of Big Data is discussed, along with an overview of data science. - Links to additional research material are provided. - Main takeaways are summarized, and book recommendations are offered.
  • #8 - Cisco Systems coined the term Zettabyte Era in 2016. - In 2016, the amount of digital data in the world exceeded a billion trillion bytes.
  • #9 - CSV [row-oriented] is the least common denominator. - Schema (structure) information is lost with CSV. - Parquet [column-oriented], Avro [row-oriented] and ORC [column-oriented] are used in the Apache Hadoop ecosystem. - JSON is neither row- nor column-oriented.
  • #10 - The greatest value of a picture is when it forces us to notice what we never expected to see. - Scatter plots display raw data and are useful for quickly identifying general trends and outliers. - Box (AKA box & whisker) plots show data range with respect to the median. - The rose chart was used by statistician and medical reformer Florence Nightingale to communicate the avoidable deaths of soldiers during the Crimean war. - See also https://datavizcatalogue.com/.
  • #11 - Example data science project: I Found the Weirdest Place in America Using Data Analysis.
  • #12 - Open-source adaptation of Google data analytical tools. - Apache Spark is now used in conjunction with Hadoop.
  • #13 - In 1962, John Tuckey published his seminal work on data analytics. - In 1989, Tim Berners-Lee invented his universal linked information system while at CERN. - In 1998, John Mashey coined the term Big Data while at SGI. - In 2003, Sanjay Ghemawat co-authored the Google File System white paper while at Google. - In 2004, Sanjay Ghemawat co-authored the MapReduce white paper while at Google. - In 2016, Cisco Systems coined the term Zettabyte Era.
  • #14 - Considered one of the fathers of data analytics and data science.
  • #15 - See also https://www.edwardtufte.com/online-course/.
  • #16 - The man who called it big.
  • #17 - Worked on the foundation technology found today in Apache Hadoop. - Google is one of the world’s largest technology companies. - Big Data built Google.
  • #18 - Spotting trends by tapping into the motherlode.
  • #19 - Volume - Value (business) - Variety (data) - Velocity - Veracity (accuracy)
  • #20 - Get a good textbook on linear algebra.
  • #21 - In 2030, the number of data bytes in the World Wide Web will be three times the number of stars in the universe. - Get ready, all you emergence theorists! :)
  • #22 Gracias! Merci! Grazie! Danke! Diolch! Spasibo! Xie-Xie! Arigato! Gamsahaeyo! Dhanyavaad!
  • #23 - 1196 Borregas (Gone Now! Not Coming Back :). - 2200 Mission College (Robert Noyce Building). - Frankie say no more. Stanley Main Beach Piz Buin A poem by Arthur Morgan A poem by Arthur Morgan Talk. Gazing up, The fog lifts. The steely-eyed climber blocks the sun with her hand. Talk. You know, Talk about the future. There’s something to be said about rarefied air. Talk about what can be. It sharpens the mind, Dreams and desires. And focuses it on an ancient singularity. Love. Unrequited love. The fog rolls in.