Fundamentals of Big Data

3
Introduction
In 2005, Mark Kryder observed that
magnetic disk storage capacity was
increasing very quickly, “Inside of a
decade and a half, hard disks had
increased their capacity 1,000-fold.”
Intel founder Gordon Moore
called this rate of increase
"flabbergasting.”

4
Being Data-driven
● Companies have taken advantage of this ability to store and quickly access massive
amounts data—so much so that every day across the globe we create 2.5 quintillion
bytes of data (2.5 exabytes).*
● Data-driven companies have demanded new data-processing and analysis techniques
that can scale to handle very large computing workloads.
● This ensuing explosion in the amount and variety of data available, and the challenges
of processing and analyzing it, led to the concept of big data.
*IBM

5
A few Big Data Success Stories
● PredPol Inc., the Los Angeles and Santa Cruz police departments, and a team of
educators created software to analyze crime data, and predict where crimes are
likely to occur down to 500 square feet. In areas in LA where the software is being
used, there's been a 33% reduction in burglaries and a 21% reduction in violent
crimes.
● The Tesco supermarket chain collected 70 million data points (such as energy
consumption) from its refrigerators and learned to predict when the refrigerators
need servicing to cut down on energy costs.
Source: searchcio.techtarget.com

6
3 Key Things Driving The Growth of Big Data
1. People
● Using mobile phones, the Internet, and a variety of other things, billions of people are creating and
consuming information faster than ever before in history.
2. Organizations
● Some companies have established dominant positions as leaders in their markets by successfully
mastering a variety of complex data types and tools to run operations and derive business
intelligence insights.
● Most companies are not equipped to handle the vast amount of data available.
3. Sensors and beacons
● A sensor detects changes in its environment and converts this to information. A common example
is a motion detector.
● A beacon gives off a signal that’s detected by a sensor. One example is a Bluetooth® beacon.
● These devices have become smaller, cheaper, and more prevalent, and they generate mountains of
data.

7
Big data is A Broad Term.
Terabytes (TBs) or petabytes (PBs) of data are usually considered big data, but a 100-
gigabyte (GB) relational database could also be a big data problem.
If you have GBs of data per second coming in that you need to process and store, you
have a big data problem.
Even if you only have a moderate amount of data—if you have to repeatedly process and
analyze it, you might have a big data problem.

8
What Makes Data Big?
Big data describes situations that arise when your datasets become so large that
traditional tools, such as relational databases, can no longer adequately process data.
This could be because of:
● Volume: Your dataset is so large that it no longer fits on a single computer or
relational database.
● Velocity: Data comes in rapidly or changes so often that you can’t process it fast
enough for it to be useful.
● Variety: Data comes from a variety of sources and in different formats, which
require different types of processing.

9
Image from TechTarget: “What is big data?”

10
Other factors Impacting Big Data Management
● Value: Extracting insight from large datasets.
● Valence: The ease with which data can be moved from one storage
system to another.
● Veracity: Maintaining data integrity and accuracy.
● Viscosity: The ease with which data can be combined with other
data and made more valuable.

11
Impact of Big Data
Big data issues impact all phases of data handling, including:
● Monitoring
● Collection
● Storage
● Processing
● Analysis
● Reporting
This greatly complicates the information technology (IT) job, demanding more expertise
from IT professionals.

12
Trends in Big Data
Analysts agree that the amount of data generated every year will continue to grow
massively for the foreseeable future. This will create new opportunities to
capitalize on business insights gathered from data.
It’s likely that the variety of sources of data will continue to grow in number.
Adoption of cloud computing will continue to increase as it becomes increasingly
cheaper and easier to use cloud tools. In contrast, on-premise systems are not
likely to become significantly easier to set up and use.

13
Big Data Market
International Data Corporation forecasts that the big data technology and services market
will grow about 23% per year, with annual spending reaching $48.6 billion in 2019.
There are many companies offering services in different areas in the big data industry.
Review this overview of big data vendors and technologies provided by Capgemini.

14
Big Data Complexity Creates IT Opportunities
Data sources Ingest Process Store Analyze Visualize
Real-time processing and analytics
Flat files
EDW
Analytical (OLAP) systems
Stream computing
Operational
systems (OLTP)
Actionable
insights
Reporting
Discovery &
exploration
Modeling &
predictive
analytics
Dashboards
Transaction data
(OLTP)
Traditional data
Application data
(ERP, CRM)
Third-party data
New data sources
Machine data
Docs, emails
Social data
Sensor data
Weblogs,
clickstream data
Images, videos
Data
replication
NoSQL DBs
Staging | Exploration |
Archiving
Transformations
Transformation
Load
Data integration (ETL)
Data quality
Data prep
Data marts
Data Mgmt Governance Security
Operational DBs
ERP, CRM DBs
Advanced
analytics
Data ingest
apps

15
Big Data has Been Inaccessible to Most Businesses
● Big data is difficult: It requires experts to manage a complex, distributed
computing infrastructure. These specialists are expensive and difficult to hire, and
the work takes a lot of time.
● Big data is expensive: Costs tend to grow with the volume, velocity, and variety of
data. And computing resources must be provisioned for peak demand. That means
you might have to purchase more computing resources than you need most of the
time.

Confidential & ProprietaryGoogle Cloud Platform 16
Complexities of Big Data Processing
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
growing scale
Utilization
improvements
Typical big data processing
tasks

Confidential & ProprietaryGoogle Cloud Platform 17
Big Data Processing on A Cloud Platform
Programming
Focus on insight,
not infrastructure

18
Recast Big Data Problems as Data Science Opportunities
Your role in sales is to:
● Help identify and scope the big data problem
● Help your customer see it as solvable with data science
● Help your customer see an opportunity to get an advantage over the competition

Data Engineering Data Science
Data Reference Architecture
Cloud Pub/Sub
Asynchronous messaging
Cloud Storage
Raw log storage
Cloud Dataflow
Parallel data processing
BigQuery
Analytics Engine
Cloud Machine Learning
Train Models
Batch Pipeline

Enable object lifecycle management across classes
Single API across all storage classes
ms time to first byte for every class
Cloud Storage
Performant, unified, and cost-effective object storage

Open APIs
Global, fully-managed event delivery
Integrated with Cloud Dataflow for
stream processing
Cloud Pub/Sub - Scalable Event Ingestion and Delivery

Scale from GB to PB with zero operations
Fully Managed SQL Data Warehouse
OLAP Analytics Engine
BigQuery

Proprietary + Confidential
Bigtable: Fully Managed NoSQL Database
Supports open source HBase API and
integrates with GCP data solutions
Fully managed NoSQL, wide column
database for TB to PB datasets
Single indexed schema for thousands
of columns, millions of rows
Low latency and high throughput, millions
of operations per second

BigQuery
engine
BigQuery
Abuse detection
User interactions
Streaming Batch
User engagement analytics
Cloud Pub/Sub
ACL ACLTopic 2
Business
dashboard
Data
science
tools
Users
Devs
Data
scientists
Business
App events ACL ACLTopic 1
Storage Services
Cloud
Storage
Cloud
Datastore
Cloud
SQL
Open Source
orchestration
Connectors
Cloud Dataflow

26
Big Data Use Cases
To recap, the concept of big data covers a lot of ground and generally refers to the
collection, storage, processing, analysis, and visualization of very large and very fast-
moving datasets. Big data use cases span every industry as businesses increasingly
look to differentiate their offerings by extracting insight from the data in their business.
The following slides describe some popular big data use cases.

27
Use Case: Extract, Transform, and
Load
Whenever you’re managing a massive amount of data, you’re
going to need to:
1. Extract a lot of raw data from disparate sources.
2. Transform that data into a form that can be used for your
business operations or analysis, perhaps by aggregating
or cleansing it.
3. Load that data into your data warehouse so you can use
it.
ETL is a process that generally refers to moving data.
Sometimes people use what's called ELT, where they load the
unprepared data into a data warehouse and then prepare it
there. It's an alternative to ETL.

28
Use Case: 360-degree Customer View
A 360-degree customer view is the attempt to get a
complete view of customers by combining data from
various touch points, such as marketing and the
purchasing process. Businesses use a 360-degree
customer view to drive better engagement, more revenue,
and long-term loyalty. It’s used by:
● Financial service businesses to determine the best
financial packages—insurance, investments, and so
on—to sell to specific customers.
● Retail businesses to determine the best times to
make special offers to maximize sales.
● Enterprise businesses to determine customer
retention and upsell strategies.

29
Use Case: Fraud Detection
Fraud detection is the process of identifying anomalies in patterns of behavior that signal
potential fraud. Today, fraud detection can involve analyzing large volumes of data, such as:
● Transactions
● Authorization information
● Buying patterns
For example, it’s used by:
● Credit card companies to prevent unauthorized purchases that don’t match a
customer’s profile.
● Financial service businesses to prevent illegal financial transactions.
● Technology businesses to prevent unauthorized access to products and services, such
as email.

30
Uses Case: Saving Lives
● Sequencing a human genome—all 3 billion “letters” that denote an individual's
unique DNA sequence—is providing information that’s improving scientists'
understanding of the genetic basis of many human diseases.
● Other large-scale projects, such as the 100,000 Genomes Project, are starting to
give some families a diagnosis for a child’s mysterious condition. Participants give
consent for their genome data to be linked to information about their medical
condition and health records. The medical and genomic data is shared with
researchers to improve knowledge of the causes, treatment, and care of diseases.

31
Other Use Cases
Source: A.T. Kearney Analysis

33
● Node: Usually a device on a network. A node on the Internet is anything that has an
IP address.
● Distributed processing: The method of spreading data-processing capabilities
across a set of networked computers.
● Batch processing: Processing of sets of data instead of single units to maximize
efficiency.
● Stream processing: Continuous and automatic processing of data as it’s captured,
in order to generate systematic output.
● Massively parallel processing (MPP): The use of a large number of distributed
computers to perform a set of coordinated computations in parallel
(simultaneously).
Common Big Data Terms- I

34
Data collection: The process of gathering data for the purposes of analysis and
evaluation from a variety of sources which can be structured or unstructured in format.
Also called data capture.
Data aggregation: The process of compiling of information from multiple databases to
create a combined dataset, usually for data processing, reporting, or analysis.
Data pipeline: Executable code defining a set of data-processing steps for transforming
data.
Machine data: Records the activity and behavior of customers, users, transactions,
applications, servers, networks and mobile devices. It includes configurations, data from
APIs, message queues, change events, the output of diagnostic commands, call detail
records, sensor data from industrial systems, and more.
Common Big Data Terms- II

35
Data Science: Data Science is the field of study of where information comes from, what
it represents, and how it can be turned into valuable insights.
Data lake: A storage repository that holds a vast amount of raw data in its native format,
including structured, semistructured, and unstructured data. Data is extracted from a
data lake as needed and transformed into the format used in downline processing.
Data monitoring: A business practice in which critical business data is routinely checked
against quality control rules to make sure it is always high quality and meets previously
established standards for formatting and consistency.
Data warehouse: A system used for reporting and data analysis which are central
repositories of integrated data from one or more disparate sources.
Vanilla: Refers to an installation that is straight from the source, contains no
customization, and isn’t distributed by a third party.
Common Big Data Terms- III

36
Common Data Analytics Terms

37
Common Data Analytics Terms- I
Data mart: The part of a data warehouse that’s used to get data out to users which is
usually oriented to a specific business line or team.
Statistical computing: The interface between statistics and computer science. It’s the
area of computational science (or scientific computing) specific to the mathematical
science of statistics.
Web, mobile, and commerce analytics: The measurement, collection, analysis, and
reporting of web, mobile, or commerce data for purposes of understanding and
optimizing usage.
Online analytical processing (OLAP): An approach to answering analytical queries
swiftly as part of the broader category of business intelligence. Typical applications of
OLAP include business reporting for sales, marketing, management reporting, business
process management, budgeting and forecasting, financial reporting, and similar areas.

38
Common Data Analytics Terms- II
Statistical computing: The interface between statistics and computer science. It’s the
area of computational science (or scientific computing) specific to the mathematical
science of statistics.
Real time: Means that there is near zero latency and access to data information
whenever it is required. This leads to business insights being understood in real time
versus after an event has taken place. Analytics processing jobs used to take hours or
days, often rendering critical business information no longer useful.
Speech and vision recognition, and natural language processing: 3 core areas of
machine learning that rely on huge amounts of training data and must process large
amounts of data in real time.

39
Test Yourself: Can You Define Everything Shown?
Data sources Ingest Process Store Analyze Visualize
Real-time processing and analytics
Flat files
EDW
Analytical (OLAP) systems
Stream computing
Operational
systems (OLTP)
Actionable
Insights
Reporting
Discovery &
exploration
Modeling &
predictive
analytics
Dashboards
Transaction data
(OLTP)
Traditional data
Application data
(ERP, CRM)
Third-party data
New data sources
Machine data
Docs, emails
Social data
Sensor data
Weblogs,
clickstream data
Images, videos
Data
replication
NoSQL DBs
Staging | Exploration |
Archiving
Transformations
Transformation
Load
Data Integration (ETL)
Data quality
Data prep
Data marts
Data mgmt Governance Security
Operational DBs
ERP, CRM DBs
Advanced
analytics
Data ingest
apps

40
Additional Resources
● Big data assets on the partner portal
● Google Cloud Platform big data one pager
● Big Data and the Creative Destruction of Today’s Business Models
● Public data sets for use by anyone for analyzing problems
● Video: What is Big Data? Can it help us solve some of society’s big challenges?
● Video: Deep Learning: Intelligence from Big Data
● Online Harvard course on Data Science
● Interesting big data infographic
● How big data is changing the database landscape

Fundamentals of Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fundamentals of Big Data

Similar to Fundamentals of Big Data (20)

More from The Wisdom Daily

More from The Wisdom Daily (20)

Recently uploaded

Recently uploaded (20)

Fundamentals of Big Data