Introduce self and OutsiteWhat is BigData?Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Same as “Web-scale” dataThe challenges include:capture, curation, storage, search, sharing, transfer, analysis and visualization
OLAP (Online Analytical Processing) not a good option because of the volume of dataOLTP (Online Transaction Processing) is not designed for that type of reporting
The Hadoop ecosystem is made up of a lot of companiesHadoop also has it’s origins from Google research which I will talk about shortlyThere are also visualization tools such as Tableau (out of scope of this talk)
Google BigQuery!BigQuery is a RESTfulweb service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce.
Apache Hadoop'sMapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.The Apache Hadoop framework is composed of the following modules:Hadoop Common – contains libraries and utilities needed by other Hadoop modulesHadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.HadoopMapReduce – a programming model for large scale data processing.Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others
BigData requires massive amounts of storage on multiple drives and a file system to overcome hardware bottlenecks when processing large data sets.Multiple CPUs are required to map/reduce the data (this includes management of the individual jobs)Running jobs can take time, so the time to map/reduce as well as composing a query matters.
If you don’t, a kitten dies every minute.
No need for installing all of the server softwareEverything is hostedA lot of data science and engineering effort was performed to create BigQueryGoogle uses it internally
Google’s initial technologies where GFS andMapReduce(Google released research papers on both):The Google File System (GFS) in2003by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak LeungMapReduce: Simplified Data Processing on Large Clusters in2004by Jeffrey Dean and Sanjay GhemawatGFS is a proprietary distributed file systemThe main goals of a distributed file system is:1. Speed2. Scalability3. ReliabilityGoogle File System grew out of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days of Google, while it was still located in Stanford.It is designed to provide efficient, reliable access to data using large clusters of commodity hardware. A new version of the Google File System is codenamed Colossus.
Commodity computing, or commodity cluster computing, is the use of large numbers of already available computing components for parallel computing to get the greatest amount of useful computation at low cost. It is computing done in commodity computers as opposed to high-cost supermicrocomputers or boutique computers. They are easy to populate data centers withSome of the general characteristics of a commodity computer are:Shares a base instruction set common to many different models.Shares an architecture (memory, I/O map and expansion capability) that is common to many different models.High degree of mechanical compatibility, internal components (CPU, RAM, motherboard, peripheral cards, drives) are interchangeable with other models.Software is widely available off-the-shelf.Compatible with most available peripherals, works with most right out of the box.
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms.The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.MapReduce is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, andstatistical machine translation.
Released white paper in Sept 2010Dremel is a brand of power tools that primarily rely on their speedas opposed to torque. The goal of google for BigQuery was to query 1TB of data in less than 1s.Dremel has been in production since 2006 and has thousands ofusers within Google. It replaced MapReduce in many instances but can be complementary.Multiple instances of Dremel are deployed inthe company, ranging from tens to thousands of nodes. Examples of using the system include:• Analysis of crawled web documents.• Tracking install data for applications on Android Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build system.• Disk I/O statistics for hundreds of thousands of disks.• Resource monitoring for jobs run in Google’s data centers.• Symbols and dependencies in Google’s codebase.Dremel builds on ideas from web search and parallel DBMSs.In contrast to layers such as Pig and Hive for Hadoop, it executes queries natively withouttranslating them into MR jobs.
** The data is read-only/append **Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression. Column stores have been adopted for analyzing relational data  but to the best of my knowledge have not been extended to nested data models.One of the ingredients for building interoperable data management components is a shared storage format. Columnar storage proved successful for flat relational data but making it work for Google required adapting it to a nested data model. Figure 1 illustrates the main idea: All values of a nested field such as A.B.C are stored contiguously. Hence, A.B.C can be retrieved without reading A.E, A.B.D, etc. The challenge that it addresses is how to preserve all structural information and be able to reconstruct records from an arbitrary subset of fields.
Web based interface to managementFlat files (csv/json)Libraries in most of the major programming languagesA RESTful APISQL syntax for querying
BigQuery queries are written using a variation of the standard SQL SELECT statement.BigQuery supports a wide variety of functions such as COUNT, arithmetic expressions, and string functionshttps://developers.google.com/bigquery/query-referenceQuery syntaxSELECTWITHINFROMFLATTENJOINWHEREGROUP BYHAVINGORDER BYLIMIT** Retrieving large result sets can be time consuming – USE LIMIT and/or AGGREGATES!
Dremel has most of the standard SQL-ish functions for aggregates, such as COUNT, SUM, MIN, MAX AVGDremel also has functions for extracting JSON in a field using a JSONPath syntaxDremel has an URL and IP functions which can make quick work out of any network/web logs.
BigQuery supports multiple JOIN operations in each SELECT statement.JOIN typesBigQuery supports INNER, LEFT OUTER and CROSS JOIN operations. The default is INNER.CROSS JOIN clauses must not contain an ON clause. CROSS JOIN operations can return a large amount of data and might result in a slow and inefficient query. When possible, use regular JOIN instead.EACH modifierNormal JOIN operations require that the right-side table contains less than 8 MB of compressed data. The EACH modifier is a hint that informs the query execution engine that the JOIN might reference two large tables. The EACH modifier can't be used in CROSS JOIN clauses.When possible, use JOIN without the EACH modifier for best performance. Use JOIN EACH when table sizes are too large for JOIN.
The Building Blocks of BigQuery are:ProjectsTablesDatasetsJobs
Projects are top-level containers in Google's Cloud Platform. They store information about billing and authorized users, and they contain BigQuery data. Each project has a friendly name and a unique ID.BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your company that’s maintained by your billing department. For more information on how to grant access to your project, see Access Control.
Tables contain your data in BigQuery, along with a corresponding table schema that describes field names, types, and other information. BigQuery also supports views, virtual tables defined by a SQL query.BigQuery creates tables in one of the following ways:Loading data into a new tableRunning a queryCopying a table
Jobs are actions you construct and BigQuery executes on your behalf to load data, export data, query data, or copy data. Since jobs can potentially take a long time to complete, they execute asynchronously and can be polled for their status.BigQuery saves a history of all jobs associated with a project, accessible via the Google Developers Console.
BigQuery can be accessed/or used 3 ways:Browser tool (limited in functionality – can’t update tables)Commandline toolAPIBigQuery supports two data formats for import/export (and streaming):CSVJSON (newline-delimited)Data can be compressed via tar/gzip
BigQuery has excellent commandline tools written in Python: gcloud, bq and gsutilgcloud allows update and usage of all of the Google Cloud Services from the commandlinebq is a python-based tool that accesses BigQuery from the command line.gsutil is another cloud based tool which can upload/download files to Google Cloud StorageThese tools allow you the option to script via powershell or other means if you do not want to use the API.
Rented massive parallelism is much more cost effective than trying to set up the infrastructure to do it yourself. BigQuery is comparable to Amazon Elastic MapReduce (EMR) and Cloudera’sHadoop pricingWith Amazon EMRyou can launch a 10-node Hadoop cluster for as little as $0.15 per hour. BiqQuery does not price with a node structure, however.
Computing Bigdata requires large clusters of commodity hardware to do correctly.Maintaining a datacenter while trying to implement something like Hadoop can be very challenging for even the most veteran neck-beards.Cloud computing provides all of the redundancy, scalability and other ‘ilities’BigQuery has two pricing plans:On-DemandReserved-Capacity
Pay as you go modelResource Pricing:Loading data – FreeExporting data - FreeTable reads - FreeStorage$0.026 (per GB/month) Streaming Inserts Free until July 1, 2014 (After July 1, 2014, $0.01 (per 100,000 rows) for streaming inserts)How am I charged for queries?BigQuery uses a columnar data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table. For instance, if a table has 26 columns, and you run the following query: SELECT a, b, f FROM table1 WHERE d > 100 ORDER BY eYou would be charged for processing data in columns a, b, f, d, and e only. For more information on column-oriented database structures, see Column-oriented DBMS.BigQuery accesses all rows of a table when you run a query on the table, and charges according to the total data processed in the columns you select. ** For this reason, if you expect your queries to be generally focused on data from a particular time frame, it can be economical and sometimes better performing to shard your data into separate tables based on a timestamp.If you receive a query error, you aren't charged for that query.Resource Pricing: Interactive Queries $0.005 (per GB processed) &Batch Queries$0.005 (per GB processed)1Charges rounded up to the nearest MB; minimum 10 MB data processed per each table referenced by a query2The first 100 GB of data processed per month is at no charge3Charges are based on the uncompressed data size.
For customers with consistent or larger workloadsreserved capacity can save as much as 70% off On-Demand Pricing.To sign up for reserved capacity, contact a sales representative.