This document provides an introduction to big data, including:
- Big data is characterized by its volume, velocity, and variety, which makes it difficult to process using traditional databases and requires new technologies.
- Technologies like Hadoop, MongoDB, and cloud platforms from Google and Amazon can provide scalable storage and processing of big data.
- Examples of how big data is used include analyzing social media and search data to gain insights, enabling personalized experiences and targeted advertising.
- As data volumes continue growing exponentially from sources like sensors, simulations, and digital media, new tools and approaches are needed to effectively analyze and make sense of "big data".
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Big Data Spain
Data from the media allows to enrich our analysis and to incorporate these insights into our models to capture nonlinear behaviour and feedback effects of human interaction, assessing their global impact on the society and enabling us to construct fragility indices and early warning systems.
https://www.bigdataspain.org/2017/talk/monitoring-world-geopolitics-through-big-data
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
A final project presentation on the project based on THE GDELT Database.
Complete Report : https://samvat.github.io/ivmooc-gdelt-project/The GDELT Project - Final Report.pdf
NG2S: A Study of Pro-Environmental Tipping Point via ABMsKan Yuenyong
A study of tipping point: much less is known about the most efficient ways to reach such transitions or how self-reinforcing systemic transformations might be instigated through policy. We employ an agent-based model to study the emergence of social tipping points through various feedback loops that have been previously identified to constitute an ecological approach to human behavior. Our model suggests that even a linear introduction of pro-environmental affordances (action opportunities) to a social system can have non-linear positive effects on the emergence of collective pro-environmental behavior patterns.
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Big Data Spain
Data from the media allows to enrich our analysis and to incorporate these insights into our models to capture nonlinear behaviour and feedback effects of human interaction, assessing their global impact on the society and enabling us to construct fragility indices and early warning systems.
https://www.bigdataspain.org/2017/talk/monitoring-world-geopolitics-through-big-data
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
A final project presentation on the project based on THE GDELT Database.
Complete Report : https://samvat.github.io/ivmooc-gdelt-project/The GDELT Project - Final Report.pdf
NG2S: A Study of Pro-Environmental Tipping Point via ABMsKan Yuenyong
A study of tipping point: much less is known about the most efficient ways to reach such transitions or how self-reinforcing systemic transformations might be instigated through policy. We employ an agent-based model to study the emergence of social tipping points through various feedback loops that have been previously identified to constitute an ecological approach to human behavior. Our model suggests that even a linear introduction of pro-environmental affordances (action opportunities) to a social system can have non-linear positive effects on the emergence of collective pro-environmental behavior patterns.
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013
Additional related material at: http://wiki.knoesis.org/index.php/Smart_Data
Related paper at: http://www.knoesis.org/library/resource.php?id=1903
Abstract: We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the five V's of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.
Data Science and Big Data Analytics are everywhere. They are buzzwords that everyone is talking about. Garnet even released Hype Cycle for Data Science in July this year. And yet, many people are still confused as to what data science and big data analytics are and why they will become the new black!
This slide focuses on the core concepts and clarify the mis-understanding of those myths.
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Sheryl Grant
Harnessing the “data deluge” is promoting new conversations between disciplines. Prof. Marciano and his collaborators have been pursuing research in a number of areas including: big cultural data, access to big heterogeneous data, records in the cloud, federated grid/cloud storage, visual interfaces to large collections, policy-based frameworks to automate content management, and distributed cyberinfrastructure to enable data sharing. But more importantly, innovative technical approaches require the convergence of creative insights across computer science, the social sciences, and the humanities. This talk touches on these topics and highlights a new collaboration with partners at Duke.
Richard Marciano is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Director of the Sustainable Archives and Leveraging Technologies (SALT) lab, and co-director of the Digital Innovation Lab (DIL). He leads development of "big data" projects funded by Mellon, NSF, NARA, NHPRC, IMLS, DHS, NIEHS, and UNC. Recent 2012 grants include a JISC Digging into Data award with UC Berkeley and the U. of Liverpool, called "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry," a Mellon / UNC award called "Carolina Digital Humanities Initiative," which involves the translating of big data challenges into curricular opportunities, and an NSF award on big heterogeneous data integration.
He holds a B.S. in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego, working with teams of scholars in sciences, social sciences, and humanities.
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013
Additional related material at: http://wiki.knoesis.org/index.php/Smart_Data
Related paper at: http://www.knoesis.org/library/resource.php?id=1903
Abstract: We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the five V's of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.
Data Science and Big Data Analytics are everywhere. They are buzzwords that everyone is talking about. Garnet even released Hype Cycle for Data Science in July this year. And yet, many people are still confused as to what data science and big data analytics are and why they will become the new black!
This slide focuses on the core concepts and clarify the mis-understanding of those myths.
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Sheryl Grant
Harnessing the “data deluge” is promoting new conversations between disciplines. Prof. Marciano and his collaborators have been pursuing research in a number of areas including: big cultural data, access to big heterogeneous data, records in the cloud, federated grid/cloud storage, visual interfaces to large collections, policy-based frameworks to automate content management, and distributed cyberinfrastructure to enable data sharing. But more importantly, innovative technical approaches require the convergence of creative insights across computer science, the social sciences, and the humanities. This talk touches on these topics and highlights a new collaboration with partners at Duke.
Richard Marciano is a professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, Director of the Sustainable Archives and Leveraging Technologies (SALT) lab, and co-director of the Digital Innovation Lab (DIL). He leads development of "big data" projects funded by Mellon, NSF, NARA, NHPRC, IMLS, DHS, NIEHS, and UNC. Recent 2012 grants include a JISC Digging into Data award with UC Berkeley and the U. of Liverpool, called "Integrating Data Mining and Data Management Technologies for Scholarly Inquiry," a Mellon / UNC award called "Carolina Digital Humanities Initiative," which involves the translating of big data challenges into curricular opportunities, and an NSF award on big heterogeneous data integration.
He holds a B.S. in Avionics and Electrical Engineering, and an M.S. and Ph.D. in Computer Science, and has worked as a postdoc in Computational Geography. He conducted interdisciplinary research at the San Diego Supercomputer at UC San Diego, working with teams of scholars in sciences, social sciences, and humanities.
Class lecture by Prof. Raj Jain on Networking Issues For Big Data. The talk covers Big Data Enabled by Networking, MapReduce, Hadoop, Networking Requirements for Big Data, Recent Developments in Networking, 1. Virtualizing Computation, 2. Virtualizing Storage, 3. Virtualizing Rack Storage Connectivity, Multi-Root IOV, 4. Virtualizing Data Center Storage, 5. Virtualizing Metro Storage, Virtualizing the Global Storage, Software Defined Networking, Network Function Virtualization (NFV), Big Data for Networking. Video recording available in YouTube.
Big Data Visualization
Kwan-Liu Ma
Professor of Computer Science and Chair of the Graduate Group in Computer Science (GGCS) at the University of California-Davis
January 22nd 2014
We are entering a data-rich era. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Web and mobile device users is expected to be even greater. To make sense and maximize utilization of such vast amounts of data for knowledge discovery and decision making, we need a new set of tools beyond conventional data mining and statistical analysis. One such a tool is visualization. I will present visualizations designed for gleaning insight from massive data and guiding complex data analysis tasks. I will show case studies using data from cyber/homeland security, large-scale scientific simulations, medicine, and sociological studies.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
Mind-Blown. "If you burned all data created into just one day onto dvd's, you could stack them on top of each other and reach the moon twice." Bigdata 25ntkfacts
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address.
The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective?
We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints.
That’s not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity.
This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
Según Hal Varian (experto en microeconomía y economía de la información y, desde el año 2002, Chief Economist de Google) “En los próximos años, el trabajo más atractivo será el de los estadísticos: La capacidad de recoger datos, comprenderlos, procesarlos, extraer su valor, visualizarlos, comunicarlos serán todas habilidades importantes en las próximas décadas. Ahora disponemos de datos gratuitos y omnipresentes. Lo que aún falta es la capacidad de comprender estos datos“.
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris
This lecture highlights current trends, challenges and opportunities related to the emergence of large amounts of data. It also presents Sirris’s recent research activities in this domain.
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
We’ll explore current and future considerations in advanced computing architectures that empower the conversion of data into knowledge. Life sciences produce the largest amount of data production out of all major science domains, making analytics and scientific computing cornerstones of modern research programs and methodologies. We’ll highlight the remarkable biomedical discoveries that are emerging through combined efforts, and discuss where and how the right infrastructure can catalyze the advancement of human knowledge. On-premises architectures as well as cloud, hybrid, and exotic architectures will all be discussed. It’s likely that all life science researchers will required advanced computing to perform their research within the next year. However, there has been less focus on advanced computing infrastructures across the industry due to the increased availability of public cloud infrastructure anything as a service models.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
4. We are living in the world of Data
Video
Surveillance
Social Media
Mobile Sensors
Gene Sequencing
Smart Grids
Geophysical Medical Imaging
Exploration
5. Some Data sizes
~40 109 Web pages at ~300 kilobytes each = 10 Petabytes
Youtube 48 hours video uploaded per minute;
in 2 months in 2010, uploaded more than total NBC ABC CBS
~2.5 petabytes per year uploaded?
LHC 15 petabytes per year
Radiology 69 petabytes per year
Square Kilometer Array Telescope will be 100
terabits/second
Earth Observation becoming ~4 petabytes per year
Earthquake Science – few terabytes total today
PolarGrid – 100’s terabytes/year
Exascale simulation data dumps – terabytes/second
5
8. Information as an Asset
• Cloud will enable larger and larger data to be
easily collected and used
• People will deposit information into the cloud
– Bank, personal ware house
• New technology will emerge
– Larger and scalable storage technology
– Innovative and complex data analysis/visualization for
multimedia data
– Security technology to ensure privacy
• Cloud will be mankind intelligent and memory!
9. “Data is the new oil.”
Andreas Weigend, Stanford (ex Amazon)
Data is more like soup – its
messy and you don’t know
what’s in it….
10. The Coming of Data Deluge
• In the past, most scientific disciplines could be described
as small data, or evendata poor. Most experiments or studies
had to contend with just a few hundred or a few thousand
data points.
• Now, thanks to massively complex new instruments and
simulators, many disciplines are generating correspondingly
massive data sets that are described as big data, or data rich.
– Consider the Large Hadron Collider, which will eventually generate
about 15 petabytes of data per year. A petabyte is about a million
gigabytes, so that qualifies as a full-fledged data deluge.
The Coming Data Deluge: As science becomes more data intensive, so does our language
BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
12. Scale: an explosion of data
http://www.phgfoundation.org/reports/10364/
“A single sequencer can now generate in a day what it took 10
years to collect for the Human Genome Project”
13. Creating a connectome
• neuroscientists have set the goal of creating a connectome, a
complete map of the brain's neural circuitry.
– an image of a cubic millimeter chunk of the brain would comprise
about 1 petabyte of data (at a 5-nanometer resolution).
– There are about a million cubic millimeters of neural matter to map,
making a total of about a thousand exabytes (an exabyte is about a
thousand petabytes)
– qualifies as what Jim Gray once called an exaflood of data.
The Coming Data Deluge: As science becomes more data intensive, so does our language
BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
14. The new model is for the data to be captured by
instruments or generated by simulations before
being processed by software and for the resulting
information or knowledge to be stored in computers.
Scientists only get to look at their data fairly late in
this pipeline. The techniques and technologies for
such data-intensive science are so different that it is
worth distinguishing data-intensive science from
computational science as a new, fourth paradigm for
scientific exploration.
—Jim Gray, computer scientist
15.
16.
17.
18. • The White House today announced a
$200 million big-data initiative to
create tools to improve scientific
research by making sense of the huge
amounts of data now available..
• Grants and research programs are
geared at improving the core
technologies around managing and
processing big data sets, speeding up
scientific research with big data, and
encouraging universities to train more
data scientists and engineers.
• The emergent field of data science is
changing the direction and speed of
scientific research by letting people
fine-tune their inquiries by tapping
into giant data sets.
• Medical research, for example, is
moving from broad-based treatments
to highly targeted pharmaceutical
testing for a segment of the
population or people with specific
genetic markers.
21. Big Data
“Big data is data that exceeds the processing
capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain
value from this data, you must choose an
alternative way to process it.”
Reference: “What is big data? An introduction to the big data
landscape.”, Edd Dumbill, http://radar.oreilly.com/2012/01/what-is-big-
data.html
22. Amazon View of Big Data
'Big data' refers to a collection of tools, techniques
and technologies which make it easy to work with
data at any scale. These distributed, scalable tools
provide flexible programming models to navigate
and explore data of any shape and size, from a
variety of sources.
23. The Value of Big Data
• Analytical use
– Big data analytics can reveal insights hidden
previously by data too costly to process.
• peer influence among customers, revealed by analyzing
shoppers’ transactions, social and geographical data.
– Being able to process every item of data in reasonable
time removes the troublesome need for sampling and
promotes an investigative approach to data.
• Enabling new products.
– Facebook has been able to craft a highly personalized
user experience and create a new kind of advertising
business
25. 3 Characteristics of Big Data
Volume • Volumes of data are larger than those conventional
relational database infrastructures can cope with
• Rate at which data flows in is much faster.
Velocity • Mobile event and interaction by users.
• Video, image , audio from users
• the source data is diverse, and doesn’t fall into neat
Variety relational structures eg. text from social networks,
image data, a raw feed directly from a sensor source.
26. Big Data Challenge
Volume
• How to process data so big that can not be move, or store.
Velocity
• A lot of data coming very fast so it can not be stored such as Web
usage log , Internet, mobile messages. Stream processing is needed
to filter unused data or extract some knowledge real-time.
Variety
• So many type of unstructured data format making conventional
database useless.
28. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
29. What is needed for big data
• Your data
• Storage infrastructure
• Computing infrastructure
• Middleware to handle BIG Data
• Data Analysis
– Statistical analysis
– Data Mining
• People
30. How to deal with big data
• Integration of
– Storage
– Processing
– Analysis Algorithm
– Visualization
Processing
Massive
Data Stream Processing Visualize
Stream processing
Storage
Processing Analysis
31. How can we store and process
massive data
• Beyond capability of a single server
• Basic Infrastructure
– Cluster of servers
– High speed interconnected
– High speed storage cluster
• Incoming data will be spread across the server farm
• Processing is quickly distributed to the farm
• Result is collected and send back
32. NoSQL (Not Only SQL)
• Next Generation Databases mostly addressing some
of the points:
– being non-relational, distributed, open-
source and horizontally scalable.
– Used to handle a huge amount of data
– The original intention has been modern web-scale
databases.
Reference: http://nosql-database.org/
33. MongoDB
• MongoDB is a general purpose, open-
source database.
• MongoDB features:
– Document data model with dynamic
schemas
– Full, flexible index support and rich queries
– Auto-Sharding for horizontal scalability
– Built-in replication for high availability
– Text search
– Advanced security
– Aggregation Framework and MapReduce
– Large media storage with GridFS
34. What is Hadoop?
- Hadoop or Apache Hadoop
- open-source software framework
- supports data-intensive distributed applications.
- develop by the Apache
- derived from Google's MapReduce and Google File
System (GFS) papers.
- Implement with Java
37. Google Cloud Platform
• App engines
– mobile and web app
• Cloud SQL
– MySQL on the cloud
• Cloud Storage
– Data storage
• Big Query
– Data analysis
• Google Compute Engine
– Processing of large data
38. Amazon
• Amazon EC2
– Computation Service using VM
• Amazon DynamoDB
– Large scalable NoSQL databased
– Fully distributed shared nothing architecture
• Amazon Elastic MapReduce (Amazon EMR)
– Hadoop based analysis engine
– Can be used to analyse data from DynamoDB
39. Issues
• I/O capability of a single computer is limited ,
how to handle massive data
• Big Data can not be moved
– Careful planning must be done to handle big data
– Processing capability must be there from the start
41. WHAT FACEBOOK KNOWS
Cameron Marlow calls himself Facebook's "in-
house sociologist." He and his team can
http://www.facebook.com/data analyze essentially all the information the site
gathers.
42. Study of Human Society
• Facebook, in collaboration with the University
of Milan, conducted experiment that involved
– the entire social network as of May 2011
– more than 10 percent of the world's population.
• Analyzing the 69 billion friend connections
among those 721 million people showed that
– four intermediary friends are usually enough to
introduce anyone to a random stranger.
43. The links of Love
• Often young women specify that
they are “in a relationship” with
their “best friend forever”.
– Roughly 20% of all relationships for
the 15-and-under crowd are
between girls.
– This number dips to 15% for 18-
year-olds and is just 7% for 25-year-
olds.
• Anonymous US users who were
over 18 at the start of the
relationship
– the average of the shortest number
of steps to get from any one U.S.
user to any other individual is 16.7.
– This is much higher than the 4.74
steps you’d need to go from any
Facebook user to another through
friendship, as opposed to romantic, Graph shown the relationship of anonymous US users who were
ties. over 18 at the start of the relationship.
http://www.facebook.com/notes/facebook-data-team/the-links-of-
love/10150572088343859
44. Why?
• Facebook can improve users experience
– make useful predictions about users' behavior
– make better guesses about which ads you might
be more or less open to at any given time
• Right before Valentine's Day this year a blog
post from the Data Science Team listed the
songs most popular with people who had
recently signaled on Facebook that they had
entered or left a relationship
45. How facebook handle Big Data?
• Facebook built its data storage system using open-
source software called Hadoop.
– Hadoop spreading them across many machines inside a
data center.
– Use Hive, open-source that acts as a translation service,
making it possible to query vast Hadoop data stores using
relatively simple code.
• Much of Facebook's data resides in one Hadoop store
more than 100 petabytes (a million gigabytes) in size,
says Sameet Agarwal, a director of engineering at
Facebook who works on data infrastructure, and the
quantity is growing exponentially. "Over the last few
years we have more than doubled in size every year,”
46. Google Flu
• pattern emerges when all the flu-
related search queries are added
together.
• We compared our query counts with
traditional flu surveillance systems
and found that many search queries
tend to be popular exactly when flu
season is happening.
• By counting how often we see these
search queries, we can estimate how
much flu is circulating in different
countries and regions around the
world.
http://www.google.org/flutrends/abo
ut/how.html
47. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
48. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
49.
50.
51.
52.
53.
54.
55.
56. Preparing for BigData
• Understanding and preparing your data
– To effectively analyse and, more importantly, cross-analyse your data sets – this is often
where the most insightful results come from – you need to have a rigorous knowledge of
what data you have.
• Getting staff up to scratch
– Finding people with data analysis experience
• Defining the business objectives
– Once the end goal has been decided then a strategy can be created for implementing big data analytics
to support the delivery of this goal
• Sourcing the right suppliers and technology
– But in terms of storage, hardware, and data warehousing, you will need to make a range of decisions to
make sure you have all the capabilities and functionality required to meet your big data needs.
http://www.thebigdatainsightgroup.com/site/article/preparing-big-data-revolution
58. Trends
• A move toward large and scalable Virtual
Infrastructure
– Providing computing service
– Providing basic storage service
– Providing Scalable large database
• NOSQL
– Providing Analysis Service
• All these services has to come together
– Big data can not moved!
59. Issues
• Security
– Will you let an important data being accumulate outside your
organization?
• If it is not an important data, why analyze them ?
– Who own the data? If you discontinue the service, is the data
being destroy properly.
– Protection in multi-tenant environment
• Big data can not be moved easily
– Processing have to be near. Just can not ship data around
• So you finally have to select the same cloud for your processing. Is it
available, easy, fast?
• New learning, development cost
– Need new programming, porting?
– Tools is mature enough?
60. When to use Big data on the Cloud
• When data is already on the cloud
– Virtual organization
– Cloud based SaaS Service
• For startup
– CAPEX to OPEX
– No need to maintain large infra
– Focus on scalability and pay as you go
– Data is on the cloud anyway
• For experimental project
– Pilot for new services
61. Summary
• Big data is coming.
– Changing the way we do science
– Big data are being accumulate anyway
– Knowledge is power.
• Better understand your customer so you can offer
better service
• Tools and Technology is available
– Still being developed fast