Discussion on cloud-based data storage and databases. Presentation done by Zia Babar at the July event of the Waterloo Data Science and Data Engineering meetup
An short introduction on Big Query. With this presentation you'll quickly discover :
How load data in BigQuery
How to build dashboard using BigQuery
How to work with BigQuery
and, at last but not least, we've added some best practices
We hope you'll enjoy this presentation and that it will help you to start exploring this wonderful solution. Don't hesitate to send us your feedbacks or questions
This is a presentation by Peter Coppola, VP of Product and Marketing at Basho Technologies and Matthew Aslett, Research Director at 451 Research. Join them as they discuss whether multi-model databases and polyglot persistence have increased operational complexity. They'll discuss the benefits and importance of NoSQL databases and how the Basho Data Platform helps enterprises leverage Big Data applications.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
An short introduction on Big Query. With this presentation you'll quickly discover :
How load data in BigQuery
How to build dashboard using BigQuery
How to work with BigQuery
and, at last but not least, we've added some best practices
We hope you'll enjoy this presentation and that it will help you to start exploring this wonderful solution. Don't hesitate to send us your feedbacks or questions
This is a presentation by Peter Coppola, VP of Product and Marketing at Basho Technologies and Matthew Aslett, Research Director at 451 Research. Join them as they discuss whether multi-model databases and polyglot persistence have increased operational complexity. They'll discuss the benefits and importance of NoSQL databases and how the Basho Data Platform helps enterprises leverage Big Data applications.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This is an introductory presentation given at DevFest Madrid 2010 by Google Developer Advocate Chris Schalk. It introduces new Google cloud technologies: Google Storage, Google Prediction API and BigQuery.
Collaboration is crucial to today’s workforce. Whether you are in a traditional office setting, work from home or travel extensively, there are tools needed to achieve successful content collaboration.
Whether your mission is to improve external collaboration, increase scalability or focus on security and compliance, find out how content collaboration with Box can improve your ROI.
To find out more on how to improve your content journey, visit IBM ECM and Box: http://ibm.co/ibm-box-partnership
Quick Intro to Google Cloud TechnologiesChris Schalk
This is the "Lightning Presentation" given at DreamForce 2011 on Google's Cloud Technologies. It covers, App Engine, Google Storage and BigQuery. #df11
In this webinar you'll learn about the best practices for Google BigQuery—and how Matillion ETL makes loading your data faster and easier. Find out from our experts how to leverage one of the largest, fastest, and most capable cloud data warehouses to improve your business and save money.
In this webinar:
- Discover how to work fast and efficiently with Google BigQuery
- Find out the best ways to monitor and control costs
- Learn to leverage Matillion ETL and optimize Google BigQuery
- Get tips and tricks for better performance
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
Google BigQuery for Everyday DeveloperMárton Kodok
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
Abhishek Somani and Adesh Rao from Qubole share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.
Strata London May 2018
How Data Virtualization Adds Value to Your Data Science StackDenodo
Watch here: https://bit.ly/3cZGCxr
For their machine learning and data science projects to be successful, data scientists need access to all of the enterprise data delivered through their myriad of data models. However, gaining access to all data, integrated into a central repository has been a challenge. Often 80% of the project time is spent on these tasks. But, a virtual layer can help the data scientist speed up some of the most tedious tasks, like data exploration and analysis. At the same time, it also integrates well with the data science ecosystem. There is no need to change tools and learn new languages. The data virtualization platform helps data scientists offload these data integration tasks, allowing them to focus on advanced analytics.
In this session, you will learn how data virtualization:
- Provides all of the enterprise data, in real-time, and without replication
- Enables data scientists to create and share multiple logical models using simple drag and drop
- Provides a catalog of all business definitions, lineage, and relationships
Big data is primarily associated with AI and new technology. It is as much a revolution in cooperation patterns, however. Big data entails the democratisation of data within an organisation, enabling agile, data-driven innovation in a manner that was previously unavailable. Knowing this, how can you work as an organisation to harvest the fruits and what can go wrong?
Webinar: Cloud Archiving – Amazon Glacier, Microsoft Azure or Something Else?Storage Switzerland
Amazon Glacier seems like the ultimate archive; capacity costs are impressively cheap, it never has to be upgraded or replaced and all the data is off-site. The problem is at some point the organization is going to need to recover some specific set of data from that archive. At that point, the allure of Glacier wears off quickly because it is slow and recovery has to be done either through the command line or within software code. There is no self-service web-portal access to identify and retrieve specific content easily.
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
Alluxio Monthly Webinar
Nov. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer)
- Beinan Wang (Senior Staff Engineer & Architect)
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
1) Optimizing a developmental setup can include manual copies, which are slow and error-prone
2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This is an introductory presentation given at DevFest Madrid 2010 by Google Developer Advocate Chris Schalk. It introduces new Google cloud technologies: Google Storage, Google Prediction API and BigQuery.
Collaboration is crucial to today’s workforce. Whether you are in a traditional office setting, work from home or travel extensively, there are tools needed to achieve successful content collaboration.
Whether your mission is to improve external collaboration, increase scalability or focus on security and compliance, find out how content collaboration with Box can improve your ROI.
To find out more on how to improve your content journey, visit IBM ECM and Box: http://ibm.co/ibm-box-partnership
Quick Intro to Google Cloud TechnologiesChris Schalk
This is the "Lightning Presentation" given at DreamForce 2011 on Google's Cloud Technologies. It covers, App Engine, Google Storage and BigQuery. #df11
In this webinar you'll learn about the best practices for Google BigQuery—and how Matillion ETL makes loading your data faster and easier. Find out from our experts how to leverage one of the largest, fastest, and most capable cloud data warehouses to improve your business and save money.
In this webinar:
- Discover how to work fast and efficiently with Google BigQuery
- Find out the best ways to monitor and control costs
- Learn to leverage Matillion ETL and optimize Google BigQuery
- Get tips and tricks for better performance
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
Google BigQuery for Everyday DeveloperMárton Kodok
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
Abhishek Somani and Adesh Rao from Qubole share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.
Strata London May 2018
How Data Virtualization Adds Value to Your Data Science StackDenodo
Watch here: https://bit.ly/3cZGCxr
For their machine learning and data science projects to be successful, data scientists need access to all of the enterprise data delivered through their myriad of data models. However, gaining access to all data, integrated into a central repository has been a challenge. Often 80% of the project time is spent on these tasks. But, a virtual layer can help the data scientist speed up some of the most tedious tasks, like data exploration and analysis. At the same time, it also integrates well with the data science ecosystem. There is no need to change tools and learn new languages. The data virtualization platform helps data scientists offload these data integration tasks, allowing them to focus on advanced analytics.
In this session, you will learn how data virtualization:
- Provides all of the enterprise data, in real-time, and without replication
- Enables data scientists to create and share multiple logical models using simple drag and drop
- Provides a catalog of all business definitions, lineage, and relationships
Big data is primarily associated with AI and new technology. It is as much a revolution in cooperation patterns, however. Big data entails the democratisation of data within an organisation, enabling agile, data-driven innovation in a manner that was previously unavailable. Knowing this, how can you work as an organisation to harvest the fruits and what can go wrong?
Webinar: Cloud Archiving – Amazon Glacier, Microsoft Azure or Something Else?Storage Switzerland
Amazon Glacier seems like the ultimate archive; capacity costs are impressively cheap, it never has to be upgraded or replaced and all the data is off-site. The problem is at some point the organization is going to need to recover some specific set of data from that archive. At that point, the allure of Glacier wears off quickly because it is slow and recovery has to be done either through the command line or within software code. There is no self-service web-portal access to identify and retrieve specific content easily.
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
Alluxio Monthly Webinar
Nov. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer)
- Beinan Wang (Senior Staff Engineer & Architect)
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
1) Optimizing a developmental setup can include manual copies, which are slow and error-prone
2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
In this on demand webinar, join Storage Switzerland and Cloudian as we describe three ways cloud storage and on-premises storage can complement each other.
Learn how the new world of multi-cloud storage coupled with scalable on-premises storage can:
* Achieve limitless scalability and seamless capacity expansion
* Enable unified data management across clouds and on-prem
* Consolidate object and file data to a single storage environment
* Reduce costs and eliminate complexity
Webinar: Data Protection for KubernetesMayaData Inc
In this webinar, we will back-up many live workloads to the Cloudian Hyperstore from a Kubernetes environment running on a particular cloud. We will demonstrate the value of Cloudian’s WORM capabilities to show how workloads and their data can be protected from ransomware attacks. Later, we will recover workloads from the Cloudian HyperStore to another cloud vendor. We will also demonstrate streaming back-ups for use in cloud and hardware switch overs and other use cases.
Kubera from MayaData is the first solution to extend the per workload management of data offered by Container Attached Storage to back-ups and disaster recovery. Kubera is often used by small teams to establish and manage back-up policies whereby data is backed up to S3 compatible object storage. Kubera can also be used to provide a comprehensive view across all workloads of back-up and retention policies and to enable back-ground cloud migration and disaster recovery.
Unlock Your Data for ML & AI using Data VirtualizationDenodo
How Denodo Complement’s Logical Data Lake in Cloud
● Denodo does not substitute data warehouses, data lakes,
ETLs...
● Denodo enables the use of all together plus other data
sources
○ In a logical data warehouse
○ In a logical data lake
○ They are very similar, the only difference is in the main
objective
● There are also use cases where Denodo can be used as data
source in a ETL flow
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
Watch full webinar here: https://bit.ly/3gSmtQY
Data lakes have been both praised and loathed. They can be incredibly useful to an organization, but it can also be the source of major headaches. Its ease to scale storage with minimal cost has opened the door to many new solutions, but also to a proliferation of runaway objects that have coined the term data swamp.
However, the addition of an MPP engine, based on Presto, to Denodo’s logical layer can change the way you think about the role of the data lake in your overall data strategy.
Watch on-demand this session to learn:
- The new MPP capabilities that Denodo includes
- How to use them to your advantage to improve security and governance of your lake
- New scenarios and solutions where your data fabric strategy can evolve
Object Storage promises many things - unlimited scalability, both in terms of capacity and file count, low cost but highly redundant capacity and excellent connectivity to legacy NAS. But, despite these promises object storage has not caught on in the enterprise like it has in the cloud. It seems like, for the enterprise object storage just isn’t a good fit. The problem is that most object storage system’s starting capacity is too large. And while connectivity to legacy NAS systems is available, seamless integration is not. Can object storage be sized so that it is a better fit for the enterprise?
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
Watch full webinar here:
Data lakes have been both praised and loathed. They can be incredibly useful to an organization, but it can also be the source of major headaches. Its ease to scale storage with minimal cost has opened the door to many new solutions, but also to a proliferation of runaway objects that have coined the term data swamp.
However, the addition of an MPP engine, based on Presto, to Denodo’s logical layer can change the way you think about the role of the data lake in your overall data strategy.
Watch on-demand this session to learn:
- The new MPP capabilities that Denodo includes
- How to use them to your advantage to improve security and governance of your lake
- New scenarios and solutions where your data fabric strategy can evolve
Unstructured data is growing at a staggering rate. It is breaking traditional storage and IT budgets and burying IT professionals under a mountain of operational challenges. Listen as Cloudian and Storage Switzerland discuss panel-style discussion the seven key reasons why organizations can dramatically lower storage infrastructure costs by deploying a hardware-agnostic object storage solution instead of sticking with legacy NAS.
This is the supporting presentation of a hands-on session on cloud storage as a light introduction to cloud adoption.
Cloud storage is an easy to understand, easy to use, and easy to start journey for the adoption of public cloud services.
On top, it enables and supports many different use cases where data is key.
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
What are Data Analytics Platforms? What decision points are necessary in creating a modern, unified analytics data platform? What benefits are there to building your analytics data platform on Google Cloud Platform? Susan Pierce walks us through it all.
Social network plays a fundamental role as a medium for the spread of INFLUENCE among its members. As part of this research, estimates for influence between individuals were presented.
Lykaio Wang (https://www.linkedin.com/in/lykaiowang/) discusses visualizing data in a web browser environment by initially discussing a few popular data visualizatoin libraries, and then dive further into the pros and cons of D3. Lykaio will show you why D3 is so powerful and how you can leverage it to visualize tabular, graph, geospatial or any other type of data.
Daria Voronova - The Art of Telling a StoryZia Babar
Daria Voronova (https://www.linkedin.com/in/daria-voronova-76b724b5/) takesus onto a journey of stories that can be uncovered using Tableau as a discovery tool. In this presentation, she's describes what is story telling and why is it so important in enterprise contexts, followed by how to build a story telling dashboard in Tableau.
Waterloo Data Science and Data Engineering Meetup - 2018-08-29Zia Babar
Presentation given by Atif Khan, VP AI and Data Science, Messagepoint at the August meetup event of the Waterloo Data Science and Data Engineering group.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. Cloud-Based Data
Storages and Databases
for Data Science
Waterloo Data Science and Data Engineering Meetup
Zia Babar
LinkedIn: https://www.linkedin.com/in/zbabar
Twitter: https://twitter.com/ziababar
July 2018
3. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
In Part 1 of this Series...
4. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Cloud Database
● SQL Databases
● NoSQL Databases
○ Key-Value
○ Column
○ Document
○ Graph
● Cloud-Based (on Azure)
Agenda for Part 2
Cloud Storage
● File Storage
● Block Storage
● Object Storage
● Cloud-Based
○ On Microsoft Azure
○ On Google Cloud
5. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Differences between
○ Data Lakes
○ Databases
○ Data Warehouses
● Enterprise Data Integration
What will be covered in Part 3...
● How to handle various data
types…
○ Streaming data (e.g. IoT)
○ Batch data
○ Event data
○ Log Data
6. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
About the Presenter
Zia has 19 years of professional industry experience, with the most recent 8 years
being in technical leadership roles, where he led various engineering teams
pertaining to the design, development and deployment of enterprise applications
with a particular focus on incorporating machine learning practices and cognitive
services into software applications.
Presently Zia is finishing up his PhD at the University of Toronto with particular
research interests on designing enterprise cognitive systems.
8. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Many companies require a centralized, easily accessible way to store files
and folders.
● File level storage provide traditional and simple approach to data storage
at low cost.
● Files are given a name, tagged with metadata, and organized in folders
under directories and sub-directories.
● Standard naming convention is used, which makes them easy to organize.
● Storage technologies such as NAS (Network Attached Storage) allow for
sharing at the local system level.
File Storage
9. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
File Storage - Advantages
● Traditional and simple approach.
● Hierarchical system that excels at handling relatively small amounts of
data.
● Low data storage cost and complexity.
10. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Navigating through large number of files is time consuming.
● Searching for data is problematic.
● File based operations (backup, restore, etc.) take much longer.
File Storage - Disadvantages
11. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Sharing files is simple and effective.
● Scalability can be quickly achieved using scale-out NAS solutions at low a
cost for archiving files.
● Deployment is easily attained. Porting over data is simple.
● Support for standard protocols and encryption, native replication, and
various drive technologies ensures protection of data..
File Storage Use Cases
12. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Raw storage volume filled with files that are split into chunks of data of
equal size.
● A server-based OS manages these volumes and uses them as individual
hard drives to perform native OS functions.
● Typically deployed as SAN (Storage Area Network).
Block Storage
13. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Block Storage - Advantages
● Unlike in file-based architectures, there are no additional meta-data
details associated with a block outside of its address.
● The controlling OS manages data block storage by allocating storage for
different applications and deciding where data goes in the block.
● This results in high performance with large amounts of data transfer
possible.
14. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Storage is tied to one server at a time.
● Limited metadata about the information being stored.
● Cost is calculated on block storage allocated, and not block storage used.
● Accessibility only through a running server.
● Requires needs more hands-on setup vs object storage e.g. filesystem
choices, permissions, versioning, backups, etc.
Block Storage - Disadvantages
15. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Block Storage Use Cases
● Databases and other mission-critical applications that demand
consistently high performance.
● Multiple data disks configured in a RAID array to bolster data protection
and performance.
● Virtualization software vendors use block storage as file systems for the
guest OS.
16. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Object-based storage stores data in isolated containers known as objects.
● Objects are given a unique identifiers and stored it in a flat memory
model.
● Retrieval of objects is done using its unique ID and rely on REST APIs for
access.
● Lends much greater flexibility to metadata where customized metadata
can be paired with objects with specific applications.
Object Storage
17. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Endless customization possibilities are possible with object storage.
● Object based operations are possible. Example, move objects to different
areas of storage, delete objects when no longer needed, etc.
● Scaling out an object architecture is as simple as adding additional nodes
to the storage cluster due to location transparency.
● REST based authentication and authorization approaches can be applied
here as well.
Object Storage - Advantages
18. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Generally object storage offers far more manageability, flexibility and
scalability than file and block-level systems, however this is often at the
expense of performance.
Object Storage - Disadvantages
19. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Ability to accommodate unstructured data with relative ease, thus
serving well big data needs of organizations.
● API based and object based access makes for simplified integration and
usage in web applications.
● Optimum for massive amounts of data that typically accompany archived
backups.
Object Storage Use Cases
20. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Storage Types on Microsoft Azure
Source: http://microsoftgeek.com/?p=2444
21. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Azure Storage
Source: https://www.cloudberrylab.com/blog/microsoft-azure-storage-types-explained/
22. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Microsoft Azure Table Storage
○ Designed to store structured noSQL data (in tables) with the storage being
scalable and with cheap.
○ Can be a substitute to store structured data without relying on expensive
RDBMS databases and SQL querying techniques.
Microsoft Azure Storage
Source:
https://www.infragistics.com/community/blogs/b/mihail_mateev/
posts/how-to-manage-microsoft-azure-table-storage-with-node-
js
23. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Microsoft Azure Blob Storage
○ BLOB = Binary Large OBject
○ Object storage solution for the cloud
○ Optimized for storing massive amounts of unstructured data, such as text or
binary data.
Microsoft Azure Storage
Source:
https://www.qnap.com/en-uk/how-to/tutorial/article/backup-q
nap-nas-data-to-microsoft-azure-storage
24. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Microsoft Azure Queue Storage
○ Type of storage designed to connect multiple decoupled and independent
application components.
○ Allows for stateless applications and also asynchronous message queuing.
Microsoft Azure Storage
Source:
https://newhelptech.wordpress.com/2017/11/10/step-
by-step-how-to-create-and-configure-azure-queue-sto
rage-using-visual-studio-in-microsoft-azure/
25. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Microsoft Azure Disk Storage
○ A service that allows you to create disks for your virtual machines.
○ The disk can be accessed from only one virtual machine as a local drive.
● Microsoft Azure File Storage
○ A network based storage share that can allows file to be accessed from from
different Virtual Machines.
Microsoft Azure Storage
26. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Storage Types on Google Cloud Platform
Source: Google
27. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Choosing a storage option
Source: https://cloud.google.com/storage-options/
28. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Storage Classes
Source: https://cloudplatform.googleblog.com/2016/10/introducing-Coldline-and-a-unified-platform-for-data-storage.html
30. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
SQL Databases
● One type of structure (relational)
● Developed in the 1970s to deal with first wave of data
storage applications
● Structure and data types are fixed in advance
● Uses SQL based querying through keywords such as SELECT,
INSERT, UPDATE etc.
● Examples: MySQL, Postgres, Oracle, SQL Server, etc.
31. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
NoSQL Databases
● Many different types of databases (key-value, document,
wide-column, and graph)
● Developed in 2000s to deal with limitations of SQL DBs
(concerning scale, replication, unstructured data)
● Dynamic schemas. Records add new information on the fly.
● Access is through object-oriented APIs.
● Examples: MongoDB, Cassandra, Neo4j, Redis Cache, etc.
32. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
SQL vs. NoSQL
Source: https://twitter.com/aricitak/status/781051974011813888
33. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Source: Wikipedia
Key-Value Pair
● Hold a single serialized object for each key value.
● Good for storing large volumes of data where you want to get one item
for a given key value
● No need to query based on other properties of the item.
34. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Column
● Key/value data stores that structure data storage into collections of
related columns called column families.
● Store each column family in a separate partition, with same key.
● An application can read a single column family without reading through
all of the data for an entity.
Source:
https://database.guide/what-is-a-column-store-dat
abase/
35. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
DocumentDB
● Key/value DBs in which the values are documents, "document" is a
collection of named fields and values.
● Data typically stored as XML, YAML, JSON, or plain text.
● Can query on non-key fields and define secondary indexes for faster
querying.
● Suitable for applications that retrieve data based on complex criteria
(beyond the value of the document key)
Source:
https://lennilobel.wordpress.com/2015/06/01/relational-dat
abases-vs-nosql-document-databases/
36. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Index compare b/w MongoDB and SQL
Source: http://sql-vs-nosql.blogspot.com/2013/11/indexes-comparison-mongodb-vs-mssqlserver.html
37. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
● Data schema consists of nodes, edges and properties to model the
relationship between objects.
● Can efficiently perform queries that traverse the network of objects and
the relationships between them.
Graph Databases
Source: Wikipedia
38. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Azure NoSQL Databases
Source:
https://docs.microsoft.com/en-us/aspnet/aspnet/overview/developing-apps-with-windows-azure/building-real-world-cloud-apps-with-windows-azure/dat
a-storage-options
39. Zia Babar @ Waterloo Data Science and Data Engineering Meetup, July 2018
Google Cloud Database Portfolio
Source: https://twitter.com/gregsramblings/status/839667109634293760