SlideShare a Scribd company logo
Executive Summary Google BigQuery
• Google BigQuery is a cloud-based big data analytics web service for processing
very large read-only data sets.
• Developers will be able to send up to 100,000 rows of real-time data per second
to BigQuery and analyze it in near real time.
• BigQuery is Google's fully managed, NoOps, data analytics service.
• BigQuery bills on a per-project basis, so it’s usually easiest to create a single
project for your company that’s maintained by your billing department.
• Instead of using a job to load data into BigQuery, you can choose to stream your
data into BigQuery one record at a time by using the tabledata().insertAll()
method
• There are also a variety of third-party tools that you can use to interact with
BigQuery, such as visualizing the data or loading the data.
A Big Data Solution
Done By-
Tanvi Parikh
Why should I use Google BigQuery … ?
• Collect, Ingest, Analyze all the large amounts of data your
organization/application/service generates.
• Process your Big Data in a scalable, cost-effective, fast manner to excel product
goals.
How will it really benefit me … ?
• BigQuery is Google's fully managed, NoOps, data analytics service.
• No infrastructure, database admin costs in a pay-as-you-go model.
• Myriad of features that can help your company at any stage (startup to Fortune
500).
Google BigQuery and its fit into Analytics Landscape
• MapReduce based analytics can be slow for ad-hoc queries.
• Managing data centers and tuning software takes time and money.
• Analytics tools should be services.
What makes Google BigQuery Special?
Flexible Data Ingestion
Load your data from Google Cloud Storage or
Google Cloud Datastore, or stream it into
BigQuery at 100,000 rows per second to enable
real-time analysis of your data.
Fast & Performant
BigQuery's columnar architecture is designed to
handle nested & repeated fields in a highly
performant manner, enabling queries to help you
save time and money. super-fast
Affordable Big Data
Loading and exporting data, and metadata
operations, are free of charge. Pay only for what
you store and what you query, and the first 1 TB
of data processed each month is free.
Ease of Collaboration
BigQuery enables you to access, save and share
complex datasets. You can also specify what
permissions they have on the dataset
Protected
BigQuery is built with a replicated storage
strategy. All data is encrypted both in-flight and at
rest. You can protect your data with strong role-
based ACLs that you configure and control.
Strong Partner Ecosystem
Partners have integrated BigQuery with some of
the industry-leading tools for loading,
transforming and visualizing data.
The pay-as-you-go pricing model
Resource Pricing
Loading Data Free
Exporting Data Free
Storage $0.020 per GB / month
Interactive Queries $5 per TB processed
Batch Queries $5 per TB processed
Streaming Inserts $0.01 per 200 MB(each row > 1KB)
On-Demand
Pricing
BigQuery uses a columnar data structure, which means that for a given query, you
are only charged for data processed in each column, not the entire table. The first
1 TB of data processed per month is at no charge.
More on BigQuery
• Querying massive datasets can be time consuming and expensive without the
right hardware and infrastructure.
• Google BigQuery solves this problem by enabling super-fast, SQL-like queries
against append-only tables, using the processing power of Google's
infrastructure.
• BigQuery can be accessed from a Web UI or Command Line tools or BigQuery
REST API
Uses of BigQuery
Analyzing query results using a visualization library such as Google Charts Tools API
Uses & Customer Case Studies of BigQuery
• Log Analysis - Making sense of computer generated records
• Retailer - Using data to forecast product sales
• Ads Targeting - Targeting proper customer sections
• Sensor Data - Collect and visualize ambient data
• Data Mashup - Query terabytes of heterogeneous data
Uses BigQuery to hone
ad targeting and gain
insights into their
business
Dashboards using
BigQuery to
analyze booking
and inventory
data
Use BigQuery to provide their
customers ways to expand game
engagement and find new channels for
monetization
Basic Technical Details
BigQuery
BigQuery Fundamentals
• Projects are top-level containers in Google Cloud Platform.
 They store information about billing and authorized users, and
 They contain BigQuery data.
 Each project has a friendly name and a unique ID.
• BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your
company that’s maintained by your billing department.
BigQuery Fundamentals
• Tables contain your data in BigQuery, along with a corresponding table schema that describes
field names, types, and other information.
• BigQuery also supports views, virtual tables defined by a SQL query.
• BigQuery creates tables in one of the following ways:
 Loading data into a new table
 Running a query
 Copying a table
BigQuery Fundamentals
• Datasets allow you to organize and control access to your tables. Because tables are contained
in datasets, you'll need to create at least one dataset before loading data into BigQuery.
• You share BigQuery data with others by setting ACLs on datasets, not on the tables within them.
• Jobs are actions you construct and BigQuery executes on your behalf to load data, export data,
query data, or copy data.
• Since jobs can potentially take a long time to complete, they execute asynchronously and can be
polled for their status.
• BigQuery saves a history of all jobs associated with a project, accessible via the Google Developers
Console.
Interacting with BigQuery
There are three main ways to interact with BigQuery.
1. Loading and exporting data
Before you can query any data, you'll need to load it into BigQuery.
If you want to get the data back out of BigQuery, you can export the data.
2. Querying and viewing data
Once you load your data into BigQuery, there are a few ways to query or view the data in your tables:
• Querying data
 Calling the bigquery.jobs.query() method
 Calling the bigquery.jobs.insert() method with a query configuration
• Viewing data
• Calling the bigquery.tabledata.list() method
• Calling the bigquery.jobs.getQueryResults() method
Interacting with BigQuery
3. Managing data
In addition to querying and viewing data, you can manage data in BigQuery by using functions that enable the
following tasks:
• Listing projects, jobs, tables and datasets
• Getting information about jobs, tables and datasets
• Updating or patching tables and datasets
• Deleting tables and datasets
Loading Data Into BigQuery
• Before you can query your data, you first need to load it into BigQuery. You can bulk load the data by using a job, or
stream records individually.
• Load jobs support three data sources:
1. Objects in Google Cloud Storage
2. Data sent with the job or streaming insert
3. A Google Cloud Datastore backup
• Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as
a flat or nested/repeated schema, as described in Data formats. Each individual load job can load data from multiple
sources, configured with the sourceUris property.
• It can be helpful to prepare the data before loading it into BigQuery, or transform the data if needed.
Loading Data into BigQuery
• ACCESS CONTROL –
• Streaming Data into BigQuery requires the following Access Levels -
• BigQuery: WRITE access for the dataset that contains the destination table.
• Google Cloud Storage: READ access for the object in Google Cloud Storage, if loading data from Google Cloud Storage.
• Google Cloud Datastorage: READ access to the Cloud Datastore to backup objects in Google Cloud Storage
• QUOTA POLICY –
• Daily limit : 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures)
• Maximum size per load: 5 TB across all input files for CSV and JSON.
• Maximum number of files per load: 100,000
• DATA AVAILABILITY –
• Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several hours of inactivity,
the warm-up period will occur again to make that data queryable.
• Data can take up to 90 minutes to become available for copy and export operations.
• DATA INCONSISTENCY –
• Once you've called jobs.insert() to start a job, you can poll the job for its status by calling jobs.get().
• We recommend generating a job ID and passing it as jobReference.jobId when calling jobs.insert(). This approach is more robust to network
failure because the client can poll or retry on the known job ID.
• Note that calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and
at most one of those operations will succeed.
Loading Data into BigQuery
Third Party Tools
ETL tools for loading data into BigQuery Visualization and Business Intelligence
Loading Data into BigQuery
Loading Data Using the Web Browser
• Upload from local disk or from Cloud Storage
• Start the Web browser
• Select Dataset
• Create table and follow the wizard steps
Loading Data into BigQuery
Loading Data Using the BQ Tool
• If not specified, the default file format is CSV
(comma separated values)
• The files can also use newline delimited JSON
format
• Schema
 Either a filename or a comma-separated list
of column_name:datatype
 pairs that describe the file format.
• Data source may be on local machine or on
Cloud Storage
"bq load" command
Syntax:
bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table data_source_uri
table_schema
Preparing Data For BigQuery
Depending on your data's structure, you might need to prepare the data before loading it into BigQuery. Lets look at some datatypes
and formats BigQuery expects –
• DATA FORMATS – CSV, JSON
You can choose your format depending on the following factors –
1. Flat/Nested Data : JSON ; Flat Data – CSV
2. Newlines present? : JSON can be loaded much faster
• DATA FORMAT LIMITS –
• Row and Cell size limits
• File Size limits
Data format Max limit
CSV 2 MB (row and cell size)
JSON 2 MB (row size)
File Type Compressed Uncompressed
CSV 1 GB •With new-lines in strings: 4 GB
•Without new-lines in strings: 1 TB
JSON 1 GB 1 TB
Preparing Data For BigQuery
• DATATYPES – Your Data can include the following datatypes
• DATA ENCODING - BigQuery supports UTF-8 encoding for both nested/repeated and flat data, and supports ISO-8859-1 encoding
for flat data.
• DATA COMPRESSION - BigQuery can load uncompressed files significantly faster than compressed files due to parallel load
operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google
Cloud Storage costs. In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If
loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.
Data type Possible values
STRING 64 KB UTF-8 encoded string
INTEGER 64-bit signed integer
FLOAT Double-precision floating-point format
BOOLEAN •CSV format: true or false (case insensitive), or 1 or 0.
•JSON format: true or false (case insensitive)
RECORD A collection of one or more other fields
TIMESTAMP TIMESTAMP data types can be described in two ways:
UNIX timestamps or calendar datetimes.
BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.
Preparing Data For BigQuery
• DENORMALIZING YOUR DATA –
• Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency
when regular updates are being made to the data.
• In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying.
Some type of normalization is possible with the nested/repeated functionality.
Relational Database
Let's take a simple example -- recording the cities that a list of people lived in during their lives.
Flat Schema Nested/Repeated Schema
Streaming Data into BigQuery
Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a
time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a
load job.
There are several important trade-offs to consider before choosing an approach.
• ACCESS CONTROL –
• Streaming Data into BugQuery requires the following Access Levels - WRITE access for the dataset that contains the
destination table.
• QUOTA POLICY –
• Maximum row size: 1 MB
• HTTP request size limit: 10 MB
• Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors.
• Maximum rows per request: 500
• Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause quota_exceeded errors.
• DATA AVAILABILITY –
• Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several
hours of inactivity, the warm-up period will occur again to make that data queryable.
• Data can take up to 90 minutes to become available for copy and export operations.
Streaming Data into BigQuery
• DATA CONSISTENCY–
• To help ensure data consistency, you can supply insertId for each inserted row.
• BigQuery remembers this ID for at least one minute.
• If you try to stream the same set of rows within that time period and the insertId property is set,
BigQuery uses the insertIdproperty to de-duplicate your data on a best effort basis.
• Leverage the de-duplication process is for retrying inserts - as there's no way to determine the
state of a streaming insert in certain error conditions.
• For example, network errors between your system and BigQuery or internal errors within
BigQuery. In rare instances of regional data center unavailability, data duplication might occur
for the data hosted in the region experiencing the disruption. New row insertions would be
routed to data centers in another region, but de-duplication with the unavailable data would
not be possible.
Streaming Data into BigQuery - Examples
1. HIGH VOLUME EVENT LOGGING -
If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice.
Generally, these types of apps have the following criteria:
Not transactional. High volume, continuously appended rows. The app can tolerate a rare possibility that
duplication might occur or that data might be temporarily unavailable.
Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow
record selection.
One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your
app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery.
You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor
error conditions in real-time.
Accessing BigQuery
• BigQuery Web browser
• Imports/exports data, runs
queries
• BQ command line tool
• Performs operations from the
command line
• Service API
• RESTful API to access BigQuery
programmatically
• Requires authorization by OAuth2
• Google client libraries for Python,
Java, JavaScript, PHP, ...
BigQuery Tool
Web Tool
Service API
Big Query
Display
Results
BigQuery Best Practices
CSV/JSON must be split into chunks less than 1TB
• "split" command with --line-bytes option
• Split to smaller files
 Easier error recovery
 To smaller data unit (day, month instead
of year)
• Uploading to Cloud Storage is recommended
• Split Tables by Dates
 Minimize cost of data scanned
 Minimize query time
• Upload Multiple Files to Cloud Storage
 Allows parallel upload into BigQuery
• Denormalize your data
Cloud Storage
BigQuery
References for the Presentation
• https://cloud.google.com/bigquery/
• https://cloud.google.com/bigquery/what-is-bigquery
• https://cloud.google.com/bigquery/docs/reference/v2/
• https://en.wikipedia.org/wiki/BigQuery

More Related Content

What's hot

Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
Matillion
 
BigQuery walk through.pptx
BigQuery walk through.pptxBigQuery walk through.pptx
BigQuery walk through.pptx
VikRam S
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
Kostas Pardalis
 
Google Bigtable
Google BigtableGoogle Bigtable
Google Bigtable
GirdhareeSaran
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
Simon Su
 
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
javier ramirez
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
Ryuji Tamagawa
 
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
Márton Kodok
 
Discover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementDiscover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statement
Márton Kodok
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analytics
hafeeznazri
 
Google Big Table
Google Big TableGoogle Big Table
Google Big Table
Omar Al-Sabek
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
inside-BigData.com
 
PPT - Google Data Studio
PPT - Google Data StudioPPT - Google Data Studio
PPT - Google Data Studio
secretbuttoncamera
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdfGartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
momirlan
 

What's hot (20)

Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
BigQuery walk through.pptx
BigQuery walk through.pptxBigQuery walk through.pptx
BigQuery walk through.pptx
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
 
Google Bigtable
Google BigtableGoogle Bigtable
Google Bigtable
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
 
Discover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementDiscover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statement
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analytics
 
Google Big Table
Google Big TableGoogle Big Table
Google Big Table
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
 
PPT - Google Data Studio
PPT - Google Data StudioPPT - Google Data Studio
PPT - Google Data Studio
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdfGartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
 

Similar to Big query

Implementing google big query automation using google analytics data
Implementing google big query automation using google analytics dataImplementing google big query automation using google analytics data
Implementing google big query automation using google analytics data
Countants
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
Ido Green
 
Introduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQueryIntroduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQuery
Yatno Sudar
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
Tatvic Analytics
 
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Rittman Analytics
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)
Rasel Rana
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
Márton Kodok
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
William M. Cohee
 
Data Science using Google Cloud BigQuery, Python and Power BI
Data Science using Google Cloud BigQuery, Python and Power BIData Science using Google Cloud BigQuery, Python and Power BI
Data Science using Google Cloud BigQuery, Python and Power BI
Marcelo Gazzola Ribeiro
 
Google professional data engineer exam dumps
Google professional data engineer exam dumpsGoogle professional data engineer exam dumps
Google professional data engineer exam dumps
TestPrep Training
 
New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013
IBM Sverige
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_taskstovetrivel
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
MongoDB
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
MongoDB
 

Similar to Big query (20)

Implementing google big query automation using google analytics data
Implementing google big query automation using google analytics dataImplementing google big query automation using google analytics data
Implementing google big query automation using google analytics data
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
Introduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQueryIntroduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQuery
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
 
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
 
Data Science using Google Cloud BigQuery, Python and Power BI
Data Science using Google Cloud BigQuery, Python and Power BIData Science using Google Cloud BigQuery, Python and Power BI
Data Science using Google Cloud BigQuery, Python and Power BI
 
Google professional data engineer exam dumps
Google professional data engineer exam dumpsGoogle professional data engineer exam dumps
Google professional data engineer exam dumps
 
New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_tasks
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Big query

  • 1. Executive Summary Google BigQuery • Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets. • Developers will be able to send up to 100,000 rows of real-time data per second to BigQuery and analyze it in near real time. • BigQuery is Google's fully managed, NoOps, data analytics service. • BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your company that’s maintained by your billing department. • Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method • There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.
  • 2. A Big Data Solution Done By- Tanvi Parikh
  • 3. Why should I use Google BigQuery … ? • Collect, Ingest, Analyze all the large amounts of data your organization/application/service generates. • Process your Big Data in a scalable, cost-effective, fast manner to excel product goals. How will it really benefit me … ? • BigQuery is Google's fully managed, NoOps, data analytics service. • No infrastructure, database admin costs in a pay-as-you-go model. • Myriad of features that can help your company at any stage (startup to Fortune 500).
  • 4. Google BigQuery and its fit into Analytics Landscape • MapReduce based analytics can be slow for ad-hoc queries. • Managing data centers and tuning software takes time and money. • Analytics tools should be services.
  • 5. What makes Google BigQuery Special? Flexible Data Ingestion Load your data from Google Cloud Storage or Google Cloud Datastore, or stream it into BigQuery at 100,000 rows per second to enable real-time analysis of your data. Fast & Performant BigQuery's columnar architecture is designed to handle nested & repeated fields in a highly performant manner, enabling queries to help you save time and money. super-fast Affordable Big Data Loading and exporting data, and metadata operations, are free of charge. Pay only for what you store and what you query, and the first 1 TB of data processed each month is free. Ease of Collaboration BigQuery enables you to access, save and share complex datasets. You can also specify what permissions they have on the dataset Protected BigQuery is built with a replicated storage strategy. All data is encrypted both in-flight and at rest. You can protect your data with strong role- based ACLs that you configure and control. Strong Partner Ecosystem Partners have integrated BigQuery with some of the industry-leading tools for loading, transforming and visualizing data.
  • 6. The pay-as-you-go pricing model Resource Pricing Loading Data Free Exporting Data Free Storage $0.020 per GB / month Interactive Queries $5 per TB processed Batch Queries $5 per TB processed Streaming Inserts $0.01 per 200 MB(each row > 1KB) On-Demand Pricing BigQuery uses a columnar data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table. The first 1 TB of data processed per month is at no charge.
  • 7. More on BigQuery • Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. • Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google's infrastructure. • BigQuery can be accessed from a Web UI or Command Line tools or BigQuery REST API
  • 8. Uses of BigQuery Analyzing query results using a visualization library such as Google Charts Tools API
  • 9. Uses & Customer Case Studies of BigQuery • Log Analysis - Making sense of computer generated records • Retailer - Using data to forecast product sales • Ads Targeting - Targeting proper customer sections • Sensor Data - Collect and visualize ambient data • Data Mashup - Query terabytes of heterogeneous data Uses BigQuery to hone ad targeting and gain insights into their business Dashboards using BigQuery to analyze booking and inventory data Use BigQuery to provide their customers ways to expand game engagement and find new channels for monetization
  • 11. BigQuery Fundamentals • Projects are top-level containers in Google Cloud Platform.  They store information about billing and authorized users, and  They contain BigQuery data.  Each project has a friendly name and a unique ID. • BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your company that’s maintained by your billing department.
  • 12. BigQuery Fundamentals • Tables contain your data in BigQuery, along with a corresponding table schema that describes field names, types, and other information. • BigQuery also supports views, virtual tables defined by a SQL query. • BigQuery creates tables in one of the following ways:  Loading data into a new table  Running a query  Copying a table
  • 13. BigQuery Fundamentals • Datasets allow you to organize and control access to your tables. Because tables are contained in datasets, you'll need to create at least one dataset before loading data into BigQuery. • You share BigQuery data with others by setting ACLs on datasets, not on the tables within them. • Jobs are actions you construct and BigQuery executes on your behalf to load data, export data, query data, or copy data. • Since jobs can potentially take a long time to complete, they execute asynchronously and can be polled for their status. • BigQuery saves a history of all jobs associated with a project, accessible via the Google Developers Console.
  • 14. Interacting with BigQuery There are three main ways to interact with BigQuery. 1. Loading and exporting data Before you can query any data, you'll need to load it into BigQuery. If you want to get the data back out of BigQuery, you can export the data. 2. Querying and viewing data Once you load your data into BigQuery, there are a few ways to query or view the data in your tables: • Querying data  Calling the bigquery.jobs.query() method  Calling the bigquery.jobs.insert() method with a query configuration • Viewing data • Calling the bigquery.tabledata.list() method • Calling the bigquery.jobs.getQueryResults() method
  • 15. Interacting with BigQuery 3. Managing data In addition to querying and viewing data, you can manage data in BigQuery by using functions that enable the following tasks: • Listing projects, jobs, tables and datasets • Getting information about jobs, tables and datasets • Updating or patching tables and datasets • Deleting tables and datasets
  • 16. Loading Data Into BigQuery • Before you can query your data, you first need to load it into BigQuery. You can bulk load the data by using a job, or stream records individually. • Load jobs support three data sources: 1. Objects in Google Cloud Storage 2. Data sent with the job or streaming insert 3. A Google Cloud Datastore backup • Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as a flat or nested/repeated schema, as described in Data formats. Each individual load job can load data from multiple sources, configured with the sourceUris property. • It can be helpful to prepare the data before loading it into BigQuery, or transform the data if needed.
  • 17. Loading Data into BigQuery • ACCESS CONTROL – • Streaming Data into BigQuery requires the following Access Levels - • BigQuery: WRITE access for the dataset that contains the destination table. • Google Cloud Storage: READ access for the object in Google Cloud Storage, if loading data from Google Cloud Storage. • Google Cloud Datastorage: READ access to the Cloud Datastore to backup objects in Google Cloud Storage • QUOTA POLICY – • Daily limit : 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures) • Maximum size per load: 5 TB across all input files for CSV and JSON. • Maximum number of files per load: 100,000 • DATA AVAILABILITY – • Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several hours of inactivity, the warm-up period will occur again to make that data queryable. • Data can take up to 90 minutes to become available for copy and export operations. • DATA INCONSISTENCY – • Once you've called jobs.insert() to start a job, you can poll the job for its status by calling jobs.get(). • We recommend generating a job ID and passing it as jobReference.jobId when calling jobs.insert(). This approach is more robust to network failure because the client can poll or retry on the known job ID. • Note that calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.
  • 18. Loading Data into BigQuery Third Party Tools ETL tools for loading data into BigQuery Visualization and Business Intelligence
  • 19. Loading Data into BigQuery Loading Data Using the Web Browser • Upload from local disk or from Cloud Storage • Start the Web browser • Select Dataset • Create table and follow the wizard steps
  • 20. Loading Data into BigQuery Loading Data Using the BQ Tool • If not specified, the default file format is CSV (comma separated values) • The files can also use newline delimited JSON format • Schema  Either a filename or a comma-separated list of column_name:datatype  pairs that describe the file format. • Data source may be on local machine or on Cloud Storage "bq load" command Syntax: bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table data_source_uri table_schema
  • 21. Preparing Data For BigQuery Depending on your data's structure, you might need to prepare the data before loading it into BigQuery. Lets look at some datatypes and formats BigQuery expects – • DATA FORMATS – CSV, JSON You can choose your format depending on the following factors – 1. Flat/Nested Data : JSON ; Flat Data – CSV 2. Newlines present? : JSON can be loaded much faster • DATA FORMAT LIMITS – • Row and Cell size limits • File Size limits Data format Max limit CSV 2 MB (row and cell size) JSON 2 MB (row size) File Type Compressed Uncompressed CSV 1 GB •With new-lines in strings: 4 GB •Without new-lines in strings: 1 TB JSON 1 GB 1 TB
  • 22. Preparing Data For BigQuery • DATATYPES – Your Data can include the following datatypes • DATA ENCODING - BigQuery supports UTF-8 encoding for both nested/repeated and flat data, and supports ISO-8859-1 encoding for flat data. • DATA COMPRESSION - BigQuery can load uncompressed files significantly faster than compressed files due to parallel load operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google Cloud Storage costs. In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed. Data type Possible values STRING 64 KB UTF-8 encoded string INTEGER 64-bit signed integer FLOAT Double-precision floating-point format BOOLEAN •CSV format: true or false (case insensitive), or 1 or 0. •JSON format: true or false (case insensitive) RECORD A collection of one or more other fields TIMESTAMP TIMESTAMP data types can be described in two ways: UNIX timestamps or calendar datetimes. BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.
  • 23. Preparing Data For BigQuery • DENORMALIZING YOUR DATA – • Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency when regular updates are being made to the data. • In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying. Some type of normalization is possible with the nested/repeated functionality. Relational Database Let's take a simple example -- recording the cities that a list of people lived in during their lives. Flat Schema Nested/Repeated Schema
  • 24. Streaming Data into BigQuery Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a load job. There are several important trade-offs to consider before choosing an approach. • ACCESS CONTROL – • Streaming Data into BugQuery requires the following Access Levels - WRITE access for the dataset that contains the destination table. • QUOTA POLICY – • Maximum row size: 1 MB • HTTP request size limit: 10 MB • Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors. • Maximum rows per request: 500 • Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause quota_exceeded errors. • DATA AVAILABILITY – • Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several hours of inactivity, the warm-up period will occur again to make that data queryable. • Data can take up to 90 minutes to become available for copy and export operations.
  • 25. Streaming Data into BigQuery • DATA CONSISTENCY– • To help ensure data consistency, you can supply insertId for each inserted row. • BigQuery remembers this ID for at least one minute. • If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertIdproperty to de-duplicate your data on a best effort basis. • Leverage the de-duplication process is for retrying inserts - as there's no way to determine the state of a streaming insert in certain error conditions. • For example, network errors between your system and BigQuery or internal errors within BigQuery. In rare instances of regional data center unavailability, data duplication might occur for the data hosted in the region experiencing the disruption. New row insertions would be routed to data centers in another region, but de-duplication with the unavailable data would not be possible.
  • 26. Streaming Data into BigQuery - Examples 1. HIGH VOLUME EVENT LOGGING - If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice. Generally, these types of apps have the following criteria: Not transactional. High volume, continuously appended rows. The app can tolerate a rare possibility that duplication might occur or that data might be temporarily unavailable. Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow record selection. One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery. You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor error conditions in real-time.
  • 27. Accessing BigQuery • BigQuery Web browser • Imports/exports data, runs queries • BQ command line tool • Performs operations from the command line • Service API • RESTful API to access BigQuery programmatically • Requires authorization by OAuth2 • Google client libraries for Python, Java, JavaScript, PHP, ... BigQuery Tool Web Tool Service API Big Query Display Results
  • 28. BigQuery Best Practices CSV/JSON must be split into chunks less than 1TB • "split" command with --line-bytes option • Split to smaller files  Easier error recovery  To smaller data unit (day, month instead of year) • Uploading to Cloud Storage is recommended • Split Tables by Dates  Minimize cost of data scanned  Minimize query time • Upload Multiple Files to Cloud Storage  Allows parallel upload into BigQuery • Denormalize your data Cloud Storage BigQuery
  • 29. References for the Presentation • https://cloud.google.com/bigquery/ • https://cloud.google.com/bigquery/what-is-bigquery • https://cloud.google.com/bigquery/docs/reference/v2/ • https://en.wikipedia.org/wiki/BigQuery