1
FUNDAMENTALS
OF
BIG DATA
2
The Story of Big Data
3
Introduction
In 2005, Mark Kryder observed that
magnetic disk storage capacity was
increasing very quickly, “Inside of a
decade and a half, hard disks had
increased their capacity 1,000-fold.”
Intel founder Gordon Moore
called this rate of increase
"flabbergasting.”
4
Being Data-driven
● Companies have taken advantage of this ability to store and quickly access massive
amounts data—so much so that every day across the globe we create 2.5 quintillion
bytes of data (2.5 exabytes).*
● Data-driven companies have demanded new data-processing and analysis techniques
that can scale to handle very large computing workloads.
● This ensuing explosion in the amount and variety of data available, and the challenges
of processing and analyzing it, led to the concept of big data.
*IBM
5
A few Big Data Success Stories
● PredPol Inc., the Los Angeles and Santa Cruz police departments, and a team of
educators created software to analyze crime data, and predict where crimes are
likely to occur down to 500 square feet. In areas in LA where the software is being
used, there's been a 33% reduction in burglaries and a 21% reduction in violent
crimes.
● The Tesco supermarket chain collected 70 million data points (such as energy
consumption) from its refrigerators and learned to predict when the refrigerators
need servicing to cut down on energy costs.
Source: searchcio.techtarget.com
6
3 Key Things Driving The Growth of Big Data
1. People
● Using mobile phones, the Internet, and a variety of other things, billions of people are creating and
consuming information faster than ever before in history.
2. Organizations
● Some companies have established dominant positions as leaders in their markets by successfully
mastering a variety of complex data types and tools to run operations and derive business
intelligence insights.
● Most companies are not equipped to handle the vast amount of data available.
3. Sensors and beacons
● A sensor detects changes in its environment and converts this to information. A common example
is a motion detector.
● A beacon gives off a signal that’s detected by a sensor. One example is a Bluetooth® beacon.
● These devices have become smaller, cheaper, and more prevalent, and they generate mountains of
data.
7
Big data is A Broad Term.
Terabytes (TBs) or petabytes (PBs) of data are usually considered big data, but a 100-
gigabyte (GB) relational database could also be a big data problem.
If you have GBs of data per second coming in that you need to process and store, you
have a big data problem.
Even if you only have a moderate amount of data—if you have to repeatedly process and
analyze it, you might have a big data problem.
8
What Makes Data Big?
Big data describes situations that arise when your datasets become so large that
traditional tools, such as relational databases, can no longer adequately process data.
This could be because of:
● Volume: Your dataset is so large that it no longer fits on a single computer or
relational database.
● Velocity: Data comes in rapidly or changes so often that you can’t process it fast
enough for it to be useful.
● Variety: Data comes from a variety of sources and in different formats, which
require different types of processing.
9
Image from TechTarget: “What is big data?”
10
Other factors Impacting Big Data Management
● Value: Extracting insight from large datasets.
● Valence: The ease with which data can be moved from one storage
system to another.
● Veracity: Maintaining data integrity and accuracy.
● Viscosity: The ease with which data can be combined with other
data and made more valuable.
11
Impact of Big Data
Big data issues impact all phases of data handling, including:
● Monitoring
● Collection
● Storage
● Processing
● Analysis
● Reporting
This greatly complicates the information technology (IT) job, demanding more expertise
from IT professionals.
12
Trends in Big Data
Analysts agree that the amount of data generated every year will continue to grow
massively for the foreseeable future. This will create new opportunities to
capitalize on business insights gathered from data.
It’s likely that the variety of sources of data will continue to grow in number.
Adoption of cloud computing will continue to increase as it becomes increasingly
cheaper and easier to use cloud tools. In contrast, on-premise systems are not
likely to become significantly easier to set up and use.
13
Big Data Market
International Data Corporation forecasts that the big data technology and services market
will grow about 23% per year, with annual spending reaching $48.6 billion in 2019.
There are many companies offering services in different areas in the big data industry.
Review this overview of big data vendors and technologies provided by Capgemini.
14
Big Data Complexity Creates IT Opportunities
Data sources Ingest Process Store Analyze Visualize
Real-time processing and analytics
Flat files
EDW
Analytical (OLAP) systems
Stream computing
Operational
systems (OLTP)
Actionable
insights
Reporting
Discovery &
exploration
Modeling &
predictive
analytics
Dashboards
Transaction data
(OLTP)
Traditional data
Application data
(ERP, CRM)
Third-party data
New data sources
Machine data
Docs, emails
Social data
Sensor data
Weblogs,
clickstream data
Images, videos
Data
replication
NoSQL DBs
Staging | Exploration |
Archiving
Transformations
Transformation
Load
Data integration (ETL)
Data quality
Data prep
Data marts
Data Mgmt Governance Security
Operational DBs
ERP, CRM DBs
Advanced
analytics
Data ingest
apps
15
Big Data has Been Inaccessible to Most Businesses
● Big data is difficult: It requires experts to manage a complex, distributed
computing infrastructure. These specialists are expensive and difficult to hire, and
the work takes a lot of time.
● Big data is expensive: Costs tend to grow with the volume, velocity, and variety of
data. And computing resources must be provisioned for peak demand. That means
you might have to purchase more computing resources than you need most of the
time.
Confidential & ProprietaryGoogle Cloud Platform 16
Complexities of Big Data Processing
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
growing scale
Utilization
improvements
Typical big data processing
tasks
Confidential & ProprietaryGoogle Cloud Platform 17
Big Data Processing on A Cloud Platform
Programming
Focus on insight,
not infrastructure
18
Recast Big Data Problems as Data Science Opportunities
Your role in sales is to:
● Help identify and scope the big data problem
● Help your customer see it as solvable with data science
● Help your customer see an opportunity to get an advantage over the competition
19
Big Data Use Cases
Data Engineering Data Science
Data Reference Architecture
Cloud Pub/Sub
Asynchronous messaging
Cloud Storage
Raw log storage
Cloud Dataflow
Parallel data processing
BigQuery
Analytics Engine
Cloud Machine Learning
Train Models
Batch Pipeline
Enable object lifecycle management across classes
Single API across all storage classes
ms time to first byte for every class
Cloud Storage
Performant, unified, and cost-effective object storage
Open APIs
Global, fully-managed event delivery
Integrated with Cloud Dataflow for
stream processing
Cloud Pub/Sub - Scalable Event Ingestion and Delivery
Scale from GB to PB with zero operations
Fully Managed SQL Data Warehouse
OLAP Analytics Engine
BigQuery
Proprietary + Confidential
Bigtable: Fully Managed NoSQL Database
Supports open source HBase API and
integrates with GCP data solutions
Fully managed NoSQL, wide column
database for TB to PB datasets
Single indexed schema for thousands
of columns, millions of rows
Low latency and high throughput, millions
of operations per second
BigQuery
engine
BigQuery
Abuse detection
User interactions
Streaming Batch
User engagement analytics
Cloud Pub/Sub
ACL ACLTopic 2
Business
dashboard
Data
science
tools
Users
Devs
Data
scientists
Business
App events ACL ACLTopic 1
Storage Services
Cloud
Storage
Cloud
Datastore
Cloud
SQL
Open Source
orchestration
Connectors
Cloud Dataflow
26
Big Data Use Cases
To recap, the concept of big data covers a lot of ground and generally refers to the
collection, storage, processing, analysis, and visualization of very large and very fast-
moving datasets. Big data use cases span every industry as businesses increasingly
look to differentiate their offerings by extracting insight from the data in their business.
The following slides describe some popular big data use cases.
27
Use Case: Extract, Transform, and
Load
Whenever you’re managing a massive amount of data, you’re
going to need to:
1. Extract a lot of raw data from disparate sources.
2. Transform that data into a form that can be used for your
business operations or analysis, perhaps by aggregating
or cleansing it.
3. Load that data into your data warehouse so you can use
it.
ETL is a process that generally refers to moving data.
Sometimes people use what's called ELT, where they load the
unprepared data into a data warehouse and then prepare it
there. It's an alternative to ETL.
28
Use Case: 360-degree Customer View
A 360-degree customer view is the attempt to get a
complete view of customers by combining data from
various touch points, such as marketing and the
purchasing process. Businesses use a 360-degree
customer view to drive better engagement, more revenue,
and long-term loyalty. It’s used by:
● Financial service businesses to determine the best
financial packages—insurance, investments, and so
on—to sell to specific customers.
● Retail businesses to determine the best times to
make special offers to maximize sales.
● Enterprise businesses to determine customer
retention and upsell strategies.
29
Use Case: Fraud Detection
Fraud detection is the process of identifying anomalies in patterns of behavior that signal
potential fraud. Today, fraud detection can involve analyzing large volumes of data, such as:
● Transactions
● Authorization information
● Buying patterns
For example, it’s used by:
● Credit card companies to prevent unauthorized purchases that don’t match a
customer’s profile.
● Financial service businesses to prevent illegal financial transactions.
● Technology businesses to prevent unauthorized access to products and services, such
as email.
30
Uses Case: Saving Lives
● Sequencing a human genome—all 3 billion “letters” that denote an individual's
unique DNA sequence—is providing information that’s improving scientists'
understanding of the genetic basis of many human diseases.
● Other large-scale projects, such as the 100,000 Genomes Project, are starting to
give some families a diagnosis for a child’s mysterious condition. Participants give
consent for their genome data to be linked to information about their medical
condition and health records. The medical and genomic data is shared with
researchers to improve knowledge of the causes, treatment, and care of diseases.
31
Other Use Cases
Source: A.T. Kearney Analysis
32
Common Big Data Terms
33
● Node: Usually a device on a network. A node on the Internet is anything that has an
IP address.
● Distributed processing: The method of spreading data-processing capabilities
across a set of networked computers.
● Batch processing: Processing of sets of data instead of single units to maximize
efficiency.
● Stream processing: Continuous and automatic processing of data as it’s captured,
in order to generate systematic output.
● Massively parallel processing (MPP): The use of a large number of distributed
computers to perform a set of coordinated computations in parallel
(simultaneously).
Common Big Data Terms- I
34
Data collection: The process of gathering data for the purposes of analysis and
evaluation from a variety of sources which can be structured or unstructured in format.
Also called data capture.
Data aggregation: The process of compiling of information from multiple databases to
create a combined dataset, usually for data processing, reporting, or analysis.
Data pipeline: Executable code defining a set of data-processing steps for transforming
data.
Machine data: Records the activity and behavior of customers, users, transactions,
applications, servers, networks and mobile devices. It includes configurations, data from
APIs, message queues, change events, the output of diagnostic commands, call detail
records, sensor data from industrial systems, and more.
Common Big Data Terms- II
35
Data Science: Data Science is the field of study of where information comes from, what
it represents, and how it can be turned into valuable insights.
Data lake: A storage repository that holds a vast amount of raw data in its native format,
including structured, semistructured, and unstructured data. Data is extracted from a
data lake as needed and transformed into the format used in downline processing.
Data monitoring: A business practice in which critical business data is routinely checked
against quality control rules to make sure it is always high quality and meets previously
established standards for formatting and consistency.
Data warehouse: A system used for reporting and data analysis which are central
repositories of integrated data from one or more disparate sources.
Vanilla: Refers to an installation that is straight from the source, contains no
customization, and isn’t distributed by a third party.
Common Big Data Terms- III
36
Common Data Analytics Terms
37
Common Data Analytics Terms- I
Data mart: The part of a data warehouse that’s used to get data out to users which is
usually oriented to a specific business line or team.
Statistical computing: The interface between statistics and computer science. It’s the
area of computational science (or scientific computing) specific to the mathematical
science of statistics.
Web, mobile, and commerce analytics: The measurement, collection, analysis, and
reporting of web, mobile, or commerce data for purposes of understanding and
optimizing usage.
Online analytical processing (OLAP): An approach to answering analytical queries
swiftly as part of the broader category of business intelligence. Typical applications of
OLAP include business reporting for sales, marketing, management reporting, business
process management, budgeting and forecasting, financial reporting, and similar areas.
38
Common Data Analytics Terms- II
Statistical computing: The interface between statistics and computer science. It’s the
area of computational science (or scientific computing) specific to the mathematical
science of statistics.
Real time: Means that there is near zero latency and access to data information
whenever it is required. This leads to business insights being understood in real time
versus after an event has taken place. Analytics processing jobs used to take hours or
days, often rendering critical business information no longer useful.
Speech and vision recognition, and natural language processing: 3 core areas of
machine learning that rely on huge amounts of training data and must process large
amounts of data in real time.
39
Test Yourself: Can You Define Everything Shown?
Data sources Ingest Process Store Analyze Visualize
Real-time processing and analytics
Flat files
EDW
Analytical (OLAP) systems
Stream computing
Operational
systems (OLTP)
Actionable
Insights
Reporting
Discovery &
exploration
Modeling &
predictive
analytics
Dashboards
Transaction data
(OLTP)
Traditional data
Application data
(ERP, CRM)
Third-party data
New data sources
Machine data
Docs, emails
Social data
Sensor data
Weblogs,
clickstream data
Images, videos
Data
replication
NoSQL DBs
Staging | Exploration |
Archiving
Transformations
Transformation
Load
Data Integration (ETL)
Data quality
Data prep
Data marts
Data mgmt Governance Security
Operational DBs
ERP, CRM DBs
Advanced
analytics
Data ingest
apps
40
Additional Resources
● Big data assets on the partner portal
● Google Cloud Platform big data one pager
● Big Data and the Creative Destruction of Today’s Business Models
● Public data sets for use by anyone for analyzing problems
● Video: What is Big Data? Can it help us solve some of society’s big challenges?
● Video: Deep Learning: Intelligence from Big Data
● Online Harvard course on Data Science
● Interesting big data infographic
● How big data is changing the database landscape

Fundamentals of Big Data

  • 1.
  • 2.
  • 3.
    3 Introduction In 2005, MarkKryder observed that magnetic disk storage capacity was increasing very quickly, “Inside of a decade and a half, hard disks had increased their capacity 1,000-fold.” Intel founder Gordon Moore called this rate of increase "flabbergasting.”
  • 4.
    4 Being Data-driven ● Companieshave taken advantage of this ability to store and quickly access massive amounts data—so much so that every day across the globe we create 2.5 quintillion bytes of data (2.5 exabytes).* ● Data-driven companies have demanded new data-processing and analysis techniques that can scale to handle very large computing workloads. ● This ensuing explosion in the amount and variety of data available, and the challenges of processing and analyzing it, led to the concept of big data. *IBM
  • 5.
    5 A few BigData Success Stories ● PredPol Inc., the Los Angeles and Santa Cruz police departments, and a team of educators created software to analyze crime data, and predict where crimes are likely to occur down to 500 square feet. In areas in LA where the software is being used, there's been a 33% reduction in burglaries and a 21% reduction in violent crimes. ● The Tesco supermarket chain collected 70 million data points (such as energy consumption) from its refrigerators and learned to predict when the refrigerators need servicing to cut down on energy costs. Source: searchcio.techtarget.com
  • 6.
    6 3 Key ThingsDriving The Growth of Big Data 1. People ● Using mobile phones, the Internet, and a variety of other things, billions of people are creating and consuming information faster than ever before in history. 2. Organizations ● Some companies have established dominant positions as leaders in their markets by successfully mastering a variety of complex data types and tools to run operations and derive business intelligence insights. ● Most companies are not equipped to handle the vast amount of data available. 3. Sensors and beacons ● A sensor detects changes in its environment and converts this to information. A common example is a motion detector. ● A beacon gives off a signal that’s detected by a sensor. One example is a Bluetooth® beacon. ● These devices have become smaller, cheaper, and more prevalent, and they generate mountains of data.
  • 7.
    7 Big data isA Broad Term. Terabytes (TBs) or petabytes (PBs) of data are usually considered big data, but a 100- gigabyte (GB) relational database could also be a big data problem. If you have GBs of data per second coming in that you need to process and store, you have a big data problem. Even if you only have a moderate amount of data—if you have to repeatedly process and analyze it, you might have a big data problem.
  • 8.
    8 What Makes DataBig? Big data describes situations that arise when your datasets become so large that traditional tools, such as relational databases, can no longer adequately process data. This could be because of: ● Volume: Your dataset is so large that it no longer fits on a single computer or relational database. ● Velocity: Data comes in rapidly or changes so often that you can’t process it fast enough for it to be useful. ● Variety: Data comes from a variety of sources and in different formats, which require different types of processing.
  • 9.
    9 Image from TechTarget:“What is big data?”
  • 10.
    10 Other factors ImpactingBig Data Management ● Value: Extracting insight from large datasets. ● Valence: The ease with which data can be moved from one storage system to another. ● Veracity: Maintaining data integrity and accuracy. ● Viscosity: The ease with which data can be combined with other data and made more valuable.
  • 11.
    11 Impact of BigData Big data issues impact all phases of data handling, including: ● Monitoring ● Collection ● Storage ● Processing ● Analysis ● Reporting This greatly complicates the information technology (IT) job, demanding more expertise from IT professionals.
  • 12.
    12 Trends in BigData Analysts agree that the amount of data generated every year will continue to grow massively for the foreseeable future. This will create new opportunities to capitalize on business insights gathered from data. It’s likely that the variety of sources of data will continue to grow in number. Adoption of cloud computing will continue to increase as it becomes increasingly cheaper and easier to use cloud tools. In contrast, on-premise systems are not likely to become significantly easier to set up and use.
  • 13.
    13 Big Data Market InternationalData Corporation forecasts that the big data technology and services market will grow about 23% per year, with annual spending reaching $48.6 billion in 2019. There are many companies offering services in different areas in the big data industry. Review this overview of big data vendors and technologies provided by Capgemini.
  • 14.
    14 Big Data ComplexityCreates IT Opportunities Data sources Ingest Process Store Analyze Visualize Real-time processing and analytics Flat files EDW Analytical (OLAP) systems Stream computing Operational systems (OLTP) Actionable insights Reporting Discovery & exploration Modeling & predictive analytics Dashboards Transaction data (OLTP) Traditional data Application data (ERP, CRM) Third-party data New data sources Machine data Docs, emails Social data Sensor data Weblogs, clickstream data Images, videos Data replication NoSQL DBs Staging | Exploration | Archiving Transformations Transformation Load Data integration (ETL) Data quality Data prep Data marts Data Mgmt Governance Security Operational DBs ERP, CRM DBs Advanced analytics Data ingest apps
  • 15.
    15 Big Data hasBeen Inaccessible to Most Businesses ● Big data is difficult: It requires experts to manage a complex, distributed computing infrastructure. These specialists are expensive and difficult to hire, and the work takes a lot of time. ● Big data is expensive: Costs tend to grow with the volume, velocity, and variety of data. And computing resources must be provisioned for peak demand. That means you might have to purchase more computing resources than you need most of the time.
  • 16.
    Confidential & ProprietaryGoogleCloud Platform 16 Complexities of Big Data Processing Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling growing scale Utilization improvements Typical big data processing tasks
  • 17.
    Confidential & ProprietaryGoogleCloud Platform 17 Big Data Processing on A Cloud Platform Programming Focus on insight, not infrastructure
  • 18.
    18 Recast Big DataProblems as Data Science Opportunities Your role in sales is to: ● Help identify and scope the big data problem ● Help your customer see it as solvable with data science ● Help your customer see an opportunity to get an advantage over the competition
  • 19.
  • 20.
    Data Engineering DataScience Data Reference Architecture Cloud Pub/Sub Asynchronous messaging Cloud Storage Raw log storage Cloud Dataflow Parallel data processing BigQuery Analytics Engine Cloud Machine Learning Train Models Batch Pipeline
  • 21.
    Enable object lifecyclemanagement across classes Single API across all storage classes ms time to first byte for every class Cloud Storage Performant, unified, and cost-effective object storage
  • 22.
    Open APIs Global, fully-managedevent delivery Integrated with Cloud Dataflow for stream processing Cloud Pub/Sub - Scalable Event Ingestion and Delivery
  • 23.
    Scale from GBto PB with zero operations Fully Managed SQL Data Warehouse OLAP Analytics Engine BigQuery
  • 24.
    Proprietary + Confidential Bigtable:Fully Managed NoSQL Database Supports open source HBase API and integrates with GCP data solutions Fully managed NoSQL, wide column database for TB to PB datasets Single indexed schema for thousands of columns, millions of rows Low latency and high throughput, millions of operations per second
  • 25.
    BigQuery engine BigQuery Abuse detection User interactions StreamingBatch User engagement analytics Cloud Pub/Sub ACL ACLTopic 2 Business dashboard Data science tools Users Devs Data scientists Business App events ACL ACLTopic 1 Storage Services Cloud Storage Cloud Datastore Cloud SQL Open Source orchestration Connectors Cloud Dataflow
  • 26.
    26 Big Data UseCases To recap, the concept of big data covers a lot of ground and generally refers to the collection, storage, processing, analysis, and visualization of very large and very fast- moving datasets. Big data use cases span every industry as businesses increasingly look to differentiate their offerings by extracting insight from the data in their business. The following slides describe some popular big data use cases.
  • 27.
    27 Use Case: Extract,Transform, and Load Whenever you’re managing a massive amount of data, you’re going to need to: 1. Extract a lot of raw data from disparate sources. 2. Transform that data into a form that can be used for your business operations or analysis, perhaps by aggregating or cleansing it. 3. Load that data into your data warehouse so you can use it. ETL is a process that generally refers to moving data. Sometimes people use what's called ELT, where they load the unprepared data into a data warehouse and then prepare it there. It's an alternative to ETL.
  • 28.
    28 Use Case: 360-degreeCustomer View A 360-degree customer view is the attempt to get a complete view of customers by combining data from various touch points, such as marketing and the purchasing process. Businesses use a 360-degree customer view to drive better engagement, more revenue, and long-term loyalty. It’s used by: ● Financial service businesses to determine the best financial packages—insurance, investments, and so on—to sell to specific customers. ● Retail businesses to determine the best times to make special offers to maximize sales. ● Enterprise businesses to determine customer retention and upsell strategies.
  • 29.
    29 Use Case: FraudDetection Fraud detection is the process of identifying anomalies in patterns of behavior that signal potential fraud. Today, fraud detection can involve analyzing large volumes of data, such as: ● Transactions ● Authorization information ● Buying patterns For example, it’s used by: ● Credit card companies to prevent unauthorized purchases that don’t match a customer’s profile. ● Financial service businesses to prevent illegal financial transactions. ● Technology businesses to prevent unauthorized access to products and services, such as email.
  • 30.
    30 Uses Case: SavingLives ● Sequencing a human genome—all 3 billion “letters” that denote an individual's unique DNA sequence—is providing information that’s improving scientists' understanding of the genetic basis of many human diseases. ● Other large-scale projects, such as the 100,000 Genomes Project, are starting to give some families a diagnosis for a child’s mysterious condition. Participants give consent for their genome data to be linked to information about their medical condition and health records. The medical and genomic data is shared with researchers to improve knowledge of the causes, treatment, and care of diseases.
  • 31.
    31 Other Use Cases Source:A.T. Kearney Analysis
  • 32.
  • 33.
    33 ● Node: Usuallya device on a network. A node on the Internet is anything that has an IP address. ● Distributed processing: The method of spreading data-processing capabilities across a set of networked computers. ● Batch processing: Processing of sets of data instead of single units to maximize efficiency. ● Stream processing: Continuous and automatic processing of data as it’s captured, in order to generate systematic output. ● Massively parallel processing (MPP): The use of a large number of distributed computers to perform a set of coordinated computations in parallel (simultaneously). Common Big Data Terms- I
  • 34.
    34 Data collection: Theprocess of gathering data for the purposes of analysis and evaluation from a variety of sources which can be structured or unstructured in format. Also called data capture. Data aggregation: The process of compiling of information from multiple databases to create a combined dataset, usually for data processing, reporting, or analysis. Data pipeline: Executable code defining a set of data-processing steps for transforming data. Machine data: Records the activity and behavior of customers, users, transactions, applications, servers, networks and mobile devices. It includes configurations, data from APIs, message queues, change events, the output of diagnostic commands, call detail records, sensor data from industrial systems, and more. Common Big Data Terms- II
  • 35.
    35 Data Science: DataScience is the field of study of where information comes from, what it represents, and how it can be turned into valuable insights. Data lake: A storage repository that holds a vast amount of raw data in its native format, including structured, semistructured, and unstructured data. Data is extracted from a data lake as needed and transformed into the format used in downline processing. Data monitoring: A business practice in which critical business data is routinely checked against quality control rules to make sure it is always high quality and meets previously established standards for formatting and consistency. Data warehouse: A system used for reporting and data analysis which are central repositories of integrated data from one or more disparate sources. Vanilla: Refers to an installation that is straight from the source, contains no customization, and isn’t distributed by a third party. Common Big Data Terms- III
  • 36.
  • 37.
    37 Common Data AnalyticsTerms- I Data mart: The part of a data warehouse that’s used to get data out to users which is usually oriented to a specific business line or team. Statistical computing: The interface between statistics and computer science. It’s the area of computational science (or scientific computing) specific to the mathematical science of statistics. Web, mobile, and commerce analytics: The measurement, collection, analysis, and reporting of web, mobile, or commerce data for purposes of understanding and optimizing usage. Online analytical processing (OLAP): An approach to answering analytical queries swiftly as part of the broader category of business intelligence. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management, budgeting and forecasting, financial reporting, and similar areas.
  • 38.
    38 Common Data AnalyticsTerms- II Statistical computing: The interface between statistics and computer science. It’s the area of computational science (or scientific computing) specific to the mathematical science of statistics. Real time: Means that there is near zero latency and access to data information whenever it is required. This leads to business insights being understood in real time versus after an event has taken place. Analytics processing jobs used to take hours or days, often rendering critical business information no longer useful. Speech and vision recognition, and natural language processing: 3 core areas of machine learning that rely on huge amounts of training data and must process large amounts of data in real time.
  • 39.
    39 Test Yourself: CanYou Define Everything Shown? Data sources Ingest Process Store Analyze Visualize Real-time processing and analytics Flat files EDW Analytical (OLAP) systems Stream computing Operational systems (OLTP) Actionable Insights Reporting Discovery & exploration Modeling & predictive analytics Dashboards Transaction data (OLTP) Traditional data Application data (ERP, CRM) Third-party data New data sources Machine data Docs, emails Social data Sensor data Weblogs, clickstream data Images, videos Data replication NoSQL DBs Staging | Exploration | Archiving Transformations Transformation Load Data Integration (ETL) Data quality Data prep Data marts Data mgmt Governance Security Operational DBs ERP, CRM DBs Advanced analytics Data ingest apps
  • 40.
    40 Additional Resources ● Bigdata assets on the partner portal ● Google Cloud Platform big data one pager ● Big Data and the Creative Destruction of Today’s Business Models ● Public data sets for use by anyone for analyzing problems ● Video: What is Big Data? Can it help us solve some of society’s big challenges? ● Video: Deep Learning: Intelligence from Big Data ● Online Harvard course on Data Science ● Interesting big data infographic ● How big data is changing the database landscape