The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Introduction to Google BigQuery. Slides used at the first GDG Cloud meetup in Brussels, about big data on Google Cloud Platform. (http://www.meetup.com/GDG-Cloud-Belgium/events/228206131)
An short introduction on Big Query. With this presentation you'll quickly discover :
How load data in BigQuery
How to build dashboard using BigQuery
How to work with BigQuery
and, at last but not least, we've added some best practices
We hope you'll enjoy this presentation and that it will help you to start exploring this wonderful solution. Don't hesitate to send us your feedbacks or questions
Basic concepts, best practices, pricing of using BigQuery the analytic data platform at petabyte scale from Google Cloud Platform. There is a lot things to learn about this tool and its features such as BI engine and AI Platform.
The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Introduction to Google BigQuery. Slides used at the first GDG Cloud meetup in Brussels, about big data on Google Cloud Platform. (http://www.meetup.com/GDG-Cloud-Belgium/events/228206131)
An short introduction on Big Query. With this presentation you'll quickly discover :
How load data in BigQuery
How to build dashboard using BigQuery
How to work with BigQuery
and, at last but not least, we've added some best practices
We hope you'll enjoy this presentation and that it will help you to start exploring this wonderful solution. Don't hesitate to send us your feedbacks or questions
Basic concepts, best practices, pricing of using BigQuery the analytic data platform at petabyte scale from Google Cloud Platform. There is a lot things to learn about this tool and its features such as BI engine and AI Platform.
In this webinar you'll learn about the best practices for Google BigQuery—and how Matillion ETL makes loading your data faster and easier. Find out from our experts how to leverage one of the largest, fastest, and most capable cloud data warehouses to improve your business and save money.
In this webinar:
- Discover how to work fast and efficiently with Google BigQuery
- Find out the best ways to monitor and control costs
- Learn to leverage Matillion ETL and optimize Google BigQuery
- Get tips and tricks for better performance
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
Google BigQuery for Everyday DeveloperMárton Kodok
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
My Talk at GCPUG-Taiwan on 2015/5/8.
You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with.
One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent.
In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!
BigQuery best practices and recommendations to reduce costs with BI Engine, S...Márton Kodok
best practices and recommendations for tuning BI Engine for your existing BigQuery workloads for cheaper and faster queries. Learn how we at REEA are orchestrating BI Engine reservations, on a 5TB dataset, considered small for BigQuery but with big cost savings and accelerated queries. We are seeing many presentations for big enterprises, but now we are showcasing how our queries perform better with lower costs. We are going to address the top considerations when to turn on BI Engine, how to use cloud orchestration for making this an automatic process, and combined with BigQuery and Datastudio query complexity that might save precious development time, lower bills, faster queries.
Discover BigQuery ML, build your own CREATE MODEL statementMárton Kodok
With BigQuery ML, you can build machine learning models without leaving the database environment and training it on massive datasets. In this demo session we are going to demonstrate common marketing Machine Learning use cases of how to build, train, eval, and predict, your own scalable machine learning models using SQL language in Google BigQuery and to address the following use cases: - Customer Segmentation + Product cross sale recommendation - Conversion/Purchase prediction - Inference with other in-built >20 models The audience will get first-hand experience with how to write CREATE MODEL sql syntax to build machine learning models such as: - Multiclass logistic regression for classification - K-means clustering - Matrix factorization - ARIMA time series predictions ... and more Models are trained and accessed in BigQuery using SQL — a language data analysts know. This enables business decision-making through predictive analytics across the organization without leaving the query editor. In the end, the audience will learn how everyday developers can build/train/run their own machine-learning models straight from the database query editor, by issuing CREATE MODEL statements
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
This topic will cover the intermediate understanding of Google Big Query and how Media Prima Digital utilizing Big Query as data warehouse for production.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
In this presentation, Kaz Ohta, Kiyoto Tamura, and Ankush Rustagi from Treasure Data describe the company's Cloud Data Warehouse service.
"The Treasure Data Cloud Data Warehouse service enables companies to get big data analytics running in days not months without specialist IT resources and for a tenth the cost of other alternatives. Traditional data warehousing solutions - even modern alternatives such as Hadoop - are too expensive, complex and take too long for many companies to implement, so the idea of quickly launching a data warehouse service that uses the power and economics of the Cloud for companies of any size, opens up a huge potential market."
Learn more at: http://treasure-data.com * Watch the presentation video: http://inside-bigdata.com/?p=3531
How we configure Google data studio to enhance reporting for Google Google Analytics, Search Console, Pay Per Click, Google Adwords, Youtube Marketing and more.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Implementing google big query automation using google analytics dataCountants
The increasing value of big data analytics for business presents a lot of use cases for BigQuery technology. Through Google Analytics to BigQuery automation, data analysts can save time as well as extract deeper insights from the latest Google Analytics data.
In this webinar you'll learn about the best practices for Google BigQuery—and how Matillion ETL makes loading your data faster and easier. Find out from our experts how to leverage one of the largest, fastest, and most capable cloud data warehouses to improve your business and save money.
In this webinar:
- Discover how to work fast and efficiently with Google BigQuery
- Find out the best ways to monitor and control costs
- Learn to leverage Matillion ETL and optimize Google BigQuery
- Get tips and tricks for better performance
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
Google BigQuery for Everyday DeveloperMárton Kodok
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
My Talk at GCPUG-Taiwan on 2015/5/8.
You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with.
One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent.
In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!
BigQuery best practices and recommendations to reduce costs with BI Engine, S...Márton Kodok
best practices and recommendations for tuning BI Engine for your existing BigQuery workloads for cheaper and faster queries. Learn how we at REEA are orchestrating BI Engine reservations, on a 5TB dataset, considered small for BigQuery but with big cost savings and accelerated queries. We are seeing many presentations for big enterprises, but now we are showcasing how our queries perform better with lower costs. We are going to address the top considerations when to turn on BI Engine, how to use cloud orchestration for making this an automatic process, and combined with BigQuery and Datastudio query complexity that might save precious development time, lower bills, faster queries.
Discover BigQuery ML, build your own CREATE MODEL statementMárton Kodok
With BigQuery ML, you can build machine learning models without leaving the database environment and training it on massive datasets. In this demo session we are going to demonstrate common marketing Machine Learning use cases of how to build, train, eval, and predict, your own scalable machine learning models using SQL language in Google BigQuery and to address the following use cases: - Customer Segmentation + Product cross sale recommendation - Conversion/Purchase prediction - Inference with other in-built >20 models The audience will get first-hand experience with how to write CREATE MODEL sql syntax to build machine learning models such as: - Multiclass logistic regression for classification - K-means clustering - Matrix factorization - ARIMA time series predictions ... and more Models are trained and accessed in BigQuery using SQL — a language data analysts know. This enables business decision-making through predictive analytics across the organization without leaving the query editor. In the end, the audience will learn how everyday developers can build/train/run their own machine-learning models straight from the database query editor, by issuing CREATE MODEL statements
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
This topic will cover the intermediate understanding of Google Big Query and how Media Prima Digital utilizing Big Query as data warehouse for production.
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
In this presentation, Kaz Ohta, Kiyoto Tamura, and Ankush Rustagi from Treasure Data describe the company's Cloud Data Warehouse service.
"The Treasure Data Cloud Data Warehouse service enables companies to get big data analytics running in days not months without specialist IT resources and for a tenth the cost of other alternatives. Traditional data warehousing solutions - even modern alternatives such as Hadoop - are too expensive, complex and take too long for many companies to implement, so the idea of quickly launching a data warehouse service that uses the power and economics of the Cloud for companies of any size, opens up a huge potential market."
Learn more at: http://treasure-data.com * Watch the presentation video: http://inside-bigdata.com/?p=3531
How we configure Google data studio to enhance reporting for Google Google Analytics, Search Console, Pay Per Click, Google Adwords, Youtube Marketing and more.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Implementing google big query automation using google analytics dataCountants
The increasing value of big data analytics for business presents a lot of use cases for BigQuery technology. Through Google Analytics to BigQuery automation, data analysts can save time as well as extract deeper insights from the latest Google Analytics data.
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use CasesTatvic Analytics
This webinar aims to provide the BigQuery product walkthrough right from the basics. Our core focus will be on the use cases and applications that help to gain additional customer insights from the data integrated within BigQuery.
BigQuery is equipped with the ability to crunch TBs of data in seconds while ensuring scalability and speed. It also enables us to perform advanced statistical analysis by providing unsampled raw hit level analytics data.
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...Rittman Analytics
As big data and data warehousing scale-up and move into the cloud, they’re increasingly likely to be delivered as services using distributed cloud query engines such as Google BigQuery, loaded using streaming data pipelines and queried using BI tools such as Looker. In this session the presenter will walk through how data modelling and query processing works when storing petabytes of customer event-level activity in a distributed data store and query engine like BigQuery, how data ingestion and processing works in an always-on streaming data pipeline, how additional services such as Google Natural Language API can be used to classify for sentiment and extract entity nouns from incoming unstructured data, and how BI tools such as Looker and Google Data Studio bring data discovery and business metadata layers to cloud big data analytics
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Google BigQuery is the future of Analytics! (Google Developer Conference)Rasel Rana
Google Developer Group (GDG) Sonargaon is a community based focused group for developers on Google and related technologies. I tried to cover a topic on Big Data & BigQuery which is the future of analytics.
Supercharge your data analytics with BigQueryMárton Kodok
Powering interactive data analysis require massive architecture, and Know-How to build a fast real-time computing system. BigQuery solves this problem by enabling super-fast, SQL-like queries against petabytes of data using the processing power of Google’s infrastructure. We will cover its core features, creating tables, columns, views, working with partitions, clustering for cost optimizations, streaming inserts, User Defined Functions, and several use cases for everydaay developer: funnel analytics, behavioral analytics, exploring unstructured data.
The other part will be about BigQuery ML, which enables users to create and execute machine learning models in BigQuery using standard SQL queries. BigQuery ML democratizes machine learning by enabling SQL practitioners to build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
How we can use the Cloud Computing provided by the giant Google to access our data everywhere, how to connect to their service BigQuery using Python and then, how to connect the BigQuery to Power BI to create an interactive dashboard.
New Innovations in Information Management for Big Data - Smarter Business 2013IBM Sverige
Big data has changed the IT landscape. Learn how
your existing IIG investment, combined with our
latest innovations in integration and governance, is a
springboard to success with big data use cases that
unlock valuable new insights. Presenter: David Corrigan, Big Data Specialist, IBM
Webinar: Faster Big Data Analytics with MongoDBMongoDB
Learn how to leverage MongoDB and Big Data technologies to derive rich business insight and build high performance business intelligence platforms. This presentation includes:
- Uncovering Opportunities with Big Data analytics
- Challenges of real-time data processing
- Best practices for performance optimization
- Real world case study
This presentation was given in partnership with CIGNEX Datamatics.
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
This is a slide deck from QuerySurge's Big Data Testing webinar.
Learn why Testing is pivotal to the success of your Big Data Strategy .
Learn more at www.querysurge.com
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data, Hadoop and NoSQL. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This information is geared towards:
- Big Data & Data Warehouse Architects,
- ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
- Improve your Data Quality
- Accelerate your data testing cycles
- Reduce your costs & risks
- Provide a huge ROI (as high as 1,300%)
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauMongoDB
Pairing your real-time operational data stored in a modern database like MongoDB with first-class business intelligence platforms like Tableau enables new insights to be discovered faster than ever before.
Many leading organizations already use MongoDB in conjunction with Tableau including a top American investment bank and the world’s largest airline. With the Connector for BI 2.0, it’s never been easier to streamline the connection process between these two systems.
In this webinar, we will create a live connection from Tableau Desktop to a MongoDB cluster using the Connector for BI. Once we have Tableau Desktop and MongoDB connected, we will demonstrate the visual power of Tableau to explore the agile data storage of MongoDB.
You’ll walk away knowing:
- How to configure MongoDB with Tableau using the updated connector
- Best practices for working with documents in a BI environment
- How leading companies are using big data visualization strategies to transform their businesses
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Executive Summary Google BigQuery
• Google BigQuery is a cloud-based big data analytics web service for processing
very large read-only data sets.
• Developers will be able to send up to 100,000 rows of real-time data per second
to BigQuery and analyze it in near real time.
• BigQuery is Google's fully managed, NoOps, data analytics service.
• BigQuery bills on a per-project basis, so it’s usually easiest to create a single
project for your company that’s maintained by your billing department.
• Instead of using a job to load data into BigQuery, you can choose to stream your
data into BigQuery one record at a time by using the tabledata().insertAll()
method
• There are also a variety of third-party tools that you can use to interact with
BigQuery, such as visualizing the data or loading the data.
3. Why should I use Google BigQuery … ?
• Collect, Ingest, Analyze all the large amounts of data your
organization/application/service generates.
• Process your Big Data in a scalable, cost-effective, fast manner to excel product
goals.
How will it really benefit me … ?
• BigQuery is Google's fully managed, NoOps, data analytics service.
• No infrastructure, database admin costs in a pay-as-you-go model.
• Myriad of features that can help your company at any stage (startup to Fortune
500).
4. Google BigQuery and its fit into Analytics Landscape
• MapReduce based analytics can be slow for ad-hoc queries.
• Managing data centers and tuning software takes time and money.
• Analytics tools should be services.
5. What makes Google BigQuery Special?
Flexible Data Ingestion
Load your data from Google Cloud Storage or
Google Cloud Datastore, or stream it into
BigQuery at 100,000 rows per second to enable
real-time analysis of your data.
Fast & Performant
BigQuery's columnar architecture is designed to
handle nested & repeated fields in a highly
performant manner, enabling queries to help you
save time and money. super-fast
Affordable Big Data
Loading and exporting data, and metadata
operations, are free of charge. Pay only for what
you store and what you query, and the first 1 TB
of data processed each month is free.
Ease of Collaboration
BigQuery enables you to access, save and share
complex datasets. You can also specify what
permissions they have on the dataset
Protected
BigQuery is built with a replicated storage
strategy. All data is encrypted both in-flight and at
rest. You can protect your data with strong role-
based ACLs that you configure and control.
Strong Partner Ecosystem
Partners have integrated BigQuery with some of
the industry-leading tools for loading,
transforming and visualizing data.
6. The pay-as-you-go pricing model
Resource Pricing
Loading Data Free
Exporting Data Free
Storage $0.020 per GB / month
Interactive Queries $5 per TB processed
Batch Queries $5 per TB processed
Streaming Inserts $0.01 per 200 MB(each row > 1KB)
On-Demand
Pricing
BigQuery uses a columnar data structure, which means that for a given query, you
are only charged for data processed in each column, not the entire table. The first
1 TB of data processed per month is at no charge.
7. More on BigQuery
• Querying massive datasets can be time consuming and expensive without the
right hardware and infrastructure.
• Google BigQuery solves this problem by enabling super-fast, SQL-like queries
against append-only tables, using the processing power of Google's
infrastructure.
• BigQuery can be accessed from a Web UI or Command Line tools or BigQuery
REST API
9. Uses & Customer Case Studies of BigQuery
• Log Analysis - Making sense of computer generated records
• Retailer - Using data to forecast product sales
• Ads Targeting - Targeting proper customer sections
• Sensor Data - Collect and visualize ambient data
• Data Mashup - Query terabytes of heterogeneous data
Uses BigQuery to hone
ad targeting and gain
insights into their
business
Dashboards using
BigQuery to
analyze booking
and inventory
data
Use BigQuery to provide their
customers ways to expand game
engagement and find new channels for
monetization
11. BigQuery Fundamentals
• Projects are top-level containers in Google Cloud Platform.
They store information about billing and authorized users, and
They contain BigQuery data.
Each project has a friendly name and a unique ID.
• BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your
company that’s maintained by your billing department.
12. BigQuery Fundamentals
• Tables contain your data in BigQuery, along with a corresponding table schema that describes
field names, types, and other information.
• BigQuery also supports views, virtual tables defined by a SQL query.
• BigQuery creates tables in one of the following ways:
Loading data into a new table
Running a query
Copying a table
13. BigQuery Fundamentals
• Datasets allow you to organize and control access to your tables. Because tables are contained
in datasets, you'll need to create at least one dataset before loading data into BigQuery.
• You share BigQuery data with others by setting ACLs on datasets, not on the tables within them.
• Jobs are actions you construct and BigQuery executes on your behalf to load data, export data,
query data, or copy data.
• Since jobs can potentially take a long time to complete, they execute asynchronously and can be
polled for their status.
• BigQuery saves a history of all jobs associated with a project, accessible via the Google Developers
Console.
14. Interacting with BigQuery
There are three main ways to interact with BigQuery.
1. Loading and exporting data
Before you can query any data, you'll need to load it into BigQuery.
If you want to get the data back out of BigQuery, you can export the data.
2. Querying and viewing data
Once you load your data into BigQuery, there are a few ways to query or view the data in your tables:
• Querying data
Calling the bigquery.jobs.query() method
Calling the bigquery.jobs.insert() method with a query configuration
• Viewing data
• Calling the bigquery.tabledata.list() method
• Calling the bigquery.jobs.getQueryResults() method
15. Interacting with BigQuery
3. Managing data
In addition to querying and viewing data, you can manage data in BigQuery by using functions that enable the
following tasks:
• Listing projects, jobs, tables and datasets
• Getting information about jobs, tables and datasets
• Updating or patching tables and datasets
• Deleting tables and datasets
16. Loading Data Into BigQuery
• Before you can query your data, you first need to load it into BigQuery. You can bulk load the data by using a job, or
stream records individually.
• Load jobs support three data sources:
1. Objects in Google Cloud Storage
2. Data sent with the job or streaming insert
3. A Google Cloud Datastore backup
• Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as
a flat or nested/repeated schema, as described in Data formats. Each individual load job can load data from multiple
sources, configured with the sourceUris property.
• It can be helpful to prepare the data before loading it into BigQuery, or transform the data if needed.
17. Loading Data into BigQuery
• ACCESS CONTROL –
• Streaming Data into BigQuery requires the following Access Levels -
• BigQuery: WRITE access for the dataset that contains the destination table.
• Google Cloud Storage: READ access for the object in Google Cloud Storage, if loading data from Google Cloud Storage.
• Google Cloud Datastorage: READ access to the Cloud Datastore to backup objects in Google Cloud Storage
• QUOTA POLICY –
• Daily limit : 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures)
• Maximum size per load: 5 TB across all input files for CSV and JSON.
• Maximum number of files per load: 100,000
• DATA AVAILABILITY –
• Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several hours of inactivity,
the warm-up period will occur again to make that data queryable.
• Data can take up to 90 minutes to become available for copy and export operations.
• DATA INCONSISTENCY –
• Once you've called jobs.insert() to start a job, you can poll the job for its status by calling jobs.get().
• We recommend generating a job ID and passing it as jobReference.jobId when calling jobs.insert(). This approach is more robust to network
failure because the client can poll or retry on the known job ID.
• Note that calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and
at most one of those operations will succeed.
18. Loading Data into BigQuery
Third Party Tools
ETL tools for loading data into BigQuery Visualization and Business Intelligence
19. Loading Data into BigQuery
Loading Data Using the Web Browser
• Upload from local disk or from Cloud Storage
• Start the Web browser
• Select Dataset
• Create table and follow the wizard steps
20. Loading Data into BigQuery
Loading Data Using the BQ Tool
• If not specified, the default file format is CSV
(comma separated values)
• The files can also use newline delimited JSON
format
• Schema
Either a filename or a comma-separated list
of column_name:datatype
pairs that describe the file format.
• Data source may be on local machine or on
Cloud Storage
"bq load" command
Syntax:
bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table data_source_uri
table_schema
21. Preparing Data For BigQuery
Depending on your data's structure, you might need to prepare the data before loading it into BigQuery. Lets look at some datatypes
and formats BigQuery expects –
• DATA FORMATS – CSV, JSON
You can choose your format depending on the following factors –
1. Flat/Nested Data : JSON ; Flat Data – CSV
2. Newlines present? : JSON can be loaded much faster
• DATA FORMAT LIMITS –
• Row and Cell size limits
• File Size limits
Data format Max limit
CSV 2 MB (row and cell size)
JSON 2 MB (row size)
File Type Compressed Uncompressed
CSV 1 GB •With new-lines in strings: 4 GB
•Without new-lines in strings: 1 TB
JSON 1 GB 1 TB
22. Preparing Data For BigQuery
• DATATYPES – Your Data can include the following datatypes
• DATA ENCODING - BigQuery supports UTF-8 encoding for both nested/repeated and flat data, and supports ISO-8859-1 encoding
for flat data.
• DATA COMPRESSION - BigQuery can load uncompressed files significantly faster than compressed files due to parallel load
operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google
Cloud Storage costs. In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If
loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.
Data type Possible values
STRING 64 KB UTF-8 encoded string
INTEGER 64-bit signed integer
FLOAT Double-precision floating-point format
BOOLEAN •CSV format: true or false (case insensitive), or 1 or 0.
•JSON format: true or false (case insensitive)
RECORD A collection of one or more other fields
TIMESTAMP TIMESTAMP data types can be described in two ways:
UNIX timestamps or calendar datetimes.
BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.
23. Preparing Data For BigQuery
• DENORMALIZING YOUR DATA –
• Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency
when regular updates are being made to the data.
• In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying.
Some type of normalization is possible with the nested/repeated functionality.
Relational Database
Let's take a simple example -- recording the cities that a list of people lived in during their lives.
Flat Schema Nested/Repeated Schema
24. Streaming Data into BigQuery
Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a
time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a
load job.
There are several important trade-offs to consider before choosing an approach.
• ACCESS CONTROL –
• Streaming Data into BugQuery requires the following Access Levels - WRITE access for the dataset that contains the
destination table.
• QUOTA POLICY –
• Maximum row size: 1 MB
• HTTP request size limit: 10 MB
• Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors.
• Maximum rows per request: 500
• Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause quota_exceeded errors.
• DATA AVAILABILITY –
• Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several
hours of inactivity, the warm-up period will occur again to make that data queryable.
• Data can take up to 90 minutes to become available for copy and export operations.
25. Streaming Data into BigQuery
• DATA CONSISTENCY–
• To help ensure data consistency, you can supply insertId for each inserted row.
• BigQuery remembers this ID for at least one minute.
• If you try to stream the same set of rows within that time period and the insertId property is set,
BigQuery uses the insertIdproperty to de-duplicate your data on a best effort basis.
• Leverage the de-duplication process is for retrying inserts - as there's no way to determine the
state of a streaming insert in certain error conditions.
• For example, network errors between your system and BigQuery or internal errors within
BigQuery. In rare instances of regional data center unavailability, data duplication might occur
for the data hosted in the region experiencing the disruption. New row insertions would be
routed to data centers in another region, but de-duplication with the unavailable data would
not be possible.
26. Streaming Data into BigQuery - Examples
1. HIGH VOLUME EVENT LOGGING -
If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice.
Generally, these types of apps have the following criteria:
Not transactional. High volume, continuously appended rows. The app can tolerate a rare possibility that
duplication might occur or that data might be temporarily unavailable.
Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow
record selection.
One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your
app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery.
You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor
error conditions in real-time.
27. Accessing BigQuery
• BigQuery Web browser
• Imports/exports data, runs
queries
• BQ command line tool
• Performs operations from the
command line
• Service API
• RESTful API to access BigQuery
programmatically
• Requires authorization by OAuth2
• Google client libraries for Python,
Java, JavaScript, PHP, ...
BigQuery Tool
Web Tool
Service API
Big Query
Display
Results
28. BigQuery Best Practices
CSV/JSON must be split into chunks less than 1TB
• "split" command with --line-bytes option
• Split to smaller files
Easier error recovery
To smaller data unit (day, month instead
of year)
• Uploading to Cloud Storage is recommended
• Split Tables by Dates
Minimize cost of data scanned
Minimize query time
• Upload Multiple Files to Cloud Storage
Allows parallel upload into BigQuery
• Denormalize your data
Cloud Storage
BigQuery
29. References for the Presentation
• https://cloud.google.com/bigquery/
• https://cloud.google.com/bigquery/what-is-bigquery
• https://cloud.google.com/bigquery/docs/reference/v2/
• https://en.wikipedia.org/wiki/BigQuery