As the internet grew and evolved from simple text data into multimedia, relational database engines failed to adapt and support these new data types. This especially became important with mobile and our ability to share photos and videos as they were happening. We had to figure out a way to store all those cat videos and pictures of food somehow!
New content producers accelerated this growth by empowering millions of us with the ability to share photos, videos, and more through sites like Facebok, Twitter, Youtube, and Instagram.
Prior to this internet evolution, most data were transactional in nature adhering to well structured rows and columns. Since this revolution the internet has exploded in growth both in terms of it’s breadth and it’s depth. Transactional business data which supports most key types of analysis has not increased nearly at the rate as social media data, web logs, and sensor data. So when we talk about “Big Data” it’s important to keep in mind the meat of analysis at the heart of most business processes is still manageable using traditional systems. Src: IDC Digital Universe 2009: White Paper, Sponsored by EMC, 2009
For years data capture mechanisms have greatly surpassed the processing capacity of the hardware on which they ran. New technologies taking advantage of standard hardware, also known as commodity hardware, have closed the gap so it is no longer cost-prohibitive to only store part of the data versus everything. Along with these new technologies running on commodity hardware are high-powered analytical engines taking full advantage of huge improvements in throughput and disk operations found in new solid state disks.Src: http://www.mkomo.com/cost-per-gigabyte
Document stores are as the name implies, a repository of documents and the associated framework for storing an managing them. These documents are often structured in a way that applications can easily retrieve and parse their data for use. This makes querying these sources for analytics exceptionally difficult, however there are software packages aimed at providing a SQL like interface for their documents.
All of this may seem a bit overwhelming, but rest assured, there are several platform vedors to your rescue, all for a price of course. The biggest platform vendor out there is Cloudera. Cloudera offers a complete hadoop system bundled into their CDH product. They also offer professional services and many other open-source add-ons to hadoop. One of which is very interesting called Impala, which lets users execute real-time queries against the Hadoop Distributed File System or HDFS.The next platform vendor to mention is Hortonworks. Many of the key people at Hortonworks are those that created the original hadoop project dating back to it’s inception at Yahoo labs. Hortonworks also bundles all of the different technologies you need to run Hadoop, however their more focused on an older build of Hadoop, and from what I’ve seen, haven’t been pushing forward with the newest versions of the underlying tools just yet.Third, and possibly the most interesting, is Infochimps. Infochimps is a startup out of Austin offering a complete on-premise and cloud based Big Data solution. Many new platform vendors like this are sprouting up daily as they all race to capture your IT budget on your next Big Data project. This brings us to the next topic on how to implement Big Data in the cloud.
Optimized for Data Warehousing – Amazon Redshift uses a variety of innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Amazon Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources. The underlying hardware is designed for high performance data processing, using local attached storage to maximize throughput between the Intel Xeon E5 processor and drives, and a 10GigE mesh network to maximize throughput between nodes.
Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 2TB XL node and scale up all the way to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Amazon Redshift will place your existing cluster into read-only mode, provision a new cluster of your chosen size, and then copy data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.
No Up-Front Costs – You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing. On-Demand pricing starts at just $0.85 per hour for a single node 2TB data warehouse, scaling linearly with cluster size. With Reserved Instance pricing, you can lower your effective price to $0.228 per hour for a single 2TB node, or under $1,000 per TB per year. To see more details, visit the Amazon Redshift Pricing page.
Get Started in Minutes – With a few clicks in the AWS Management Console or simple API calls, you can create a cluster, specifying its size, underlying node type, and security profile. Amazon Redshift will provision your nodes, configure the connections between them, and secure the cluster. Your data warehouse should be up and running in minutes.Fully Managed – Amazon Redshift handles all the work needed to manage, monitor, and scale your data warehouse, from monitoring cluster health and taking backups to applying patches and upgrades. You can easily add or remove nodes from your cluster as your performance and capacity needs change. By handling all these time-consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business.Fault Tolerant – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary.Automated Backups – Amazon Redshift’s automated snapshot feature continuously backs up new data on the cluster to Amazon S3. Snapshots, are automated, incremental, and continous. Amazon Redshift stores your snapshots for a user-defined period, which can be from one to thirty-five days. You can also take your own snapshots at any time, which leverage all existing system snapshots and are retained until you explicitly delete them. Once you delete a cluster, your system snapshots are removed but your user snapshots are available until you explicitly delete them.Easy Restores - You can use any system or user snapshot to restore your cluster using the AWS Management Console or the Amazon Redshift APIs. Your cluster is available as soon as the system metadata has been restored and you can start running queries while user data is spooled down in the background.
Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-acccelerated AES-256 encryption for data at rest. If you choose to enable encryption of data at rest, all data written to disk will be encrypted as well as any backups.Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster. You can also run Amazon Redshift inside Amazon Virtual Private Cloud (Amazon VPC) to isolate your data warehouse cluster in your own virtual network and connect it to your existing IT infrastructure using industry-standard encrypted IPsec VPN.
SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers. Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you do today. See the Amazon Redshift partner page for details.Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. AWS Data Pipeline enables easy, programmatic integration between Amazon Redshift, Amazon Elastic MapReduce (Amazon EMR), and Amazon Relational Database Service (Amazon RDS).
BigQuery is Google’s Cloud Big Data solution based on the Dremel platform. Dremel has been in development for over 6 years and powers much of Googles Cloud Platform. It’s worth mentioning that for this course I’m going to cover BigQuery at a high-level and then later we’ll connect Tableau up to it to see how functionally to use it. If you’d like to dive deeper into BigQuery Lynn Langit has a course on here which goes into much greater detail that is definitely worth checking out.Let’s start by taking a look at their homepage.
Looking at their interface, on their homepage they proclaim, analyze terabytes of data w/ just a click of a button. Sounds promising, if it weren’t for Amazon Redshift offering petabytes in scale.You’ll also notice a query editor and result pane previewed, this is encouraging however for non-sql developers this can be a scary sight.
Similar to Amazon’s Redshift Google Bigquery stores data in a columnar database format which is great for data compression and query speeds.Google Bigquery differs from Amazon redshift however in that it uses this Tree structure which is similar to a MPP database however it spreads the data extremely wide and for queries creates execution “trees” which can scan tens of thousands of servers or leaf nodes containing the data and return results in miliseconds.Like Redshift, this all adds up to speed! Google is trying to differentiate from MPP solutions with BigQuery by providing what they call full-scan results. This is essentially by creating a query tree of every possible combination of query you can run. In their whitepaper from Kazunori Sato titled “An Inside Look at Google BigQuery” he states that “BigQuery solves the parallel disk I/O problem by utilizing the cloud platform’s economy of scale. You would need to run 10,000 disk drives and 5,000 processors simultaneously to execute the full scan of 1TB of data within one second. “ Impressive.The quote from this whitepaper tat
Dremel is the platform which Google is Based on.
Scalabitlity with Google BigQuery is a bit if a mystery to be honest. Since they handle all of the administration and data distribution for you the scalability really is only limited by based on how much you can afford. Once you upload your data to BigQuery, it handles the rest, you only need to worry about how much data is going to be processed in your queries, this brings us to their pricing model.
Big Data analysis engine without operating a data center Managed service means no additional capital costs Ability to terminate service and remove your data at any timeTransparency in pricing and usage Simplicity: only 2 pricing components (query processing, storage) Flexibility: choice to pay-by-the-month for what you useFull Visibility and Control Monthly billing: Monitor and throttle what you use Tools to optimize usage/costs: best practices, tooling, samplesSince you’re charged by amount of data processed this can be very expensive if using a “chatty” query tool like Tableau. Google recommends to shard data into separate tables using a time stamp and setting your queries to filter just to a specific date range to minimize query costs.In my view this is the only issue with BigQuery. Let’s say you have a query which pulls back something like sales for the west region by month for the past year. This will return 24 data points. That’s 12 integers for sales, and 12 date values corresponding to the month of sales. To get to these 24 data points your query may have to scan millions or billions of rows, imagine Amazon’s detailed sales transactions, aggregate the data, then return your results. Since you’re paying for all the data scanned, a single query could really rack up the bills. Now, if you were building a focused application and not doing visual analytics using a tool like Tableau you can probably handle this quite well however in this case, it can be cost prohibitive to store your data here. I have a friend who was testing this and one of his analysts actually ran a single query that cost them $400!
Big Data Analytics Preview
High Storage Extra Large (XL)
CPU: 2 virtual cores - Intel Xeon E5
Memory: 15 GiB
Storage: 3 HDD with 2TB of local
Disk I/O: Moderate
High Storage Eight Extra Large (8XL)
CPU: 16 virtual cores - Intel Xeon E5
Memory: 120 GiB
Storage: 24 HDD with 16TB of local
Network: 10 Gigabit Ethernet with support
for cluster placement groups
Disk I/O: Very High