Presentation for the use cases of NoSQL in Media. June 27th, 2014.
Covering:
- key-value
- column
- document stores
- map/reduce
- graph
- search
- blob storage
How Enterprises are Using NoSQL for Mission-Critical ApplicationsDATAVERSITY
NoSQL databases including Couchbase are increasingly being selected as the backend technology for web and mobile apps. Document databases in particular are well suited for a large number of different use cases as an operational datastore.
In this webinar, Perry Krug, Principal Solutions Architect at Couchbase, will give a brief overview of Couchbase Server, a document database and its underlying distributed architecture. In addition, Perry will share how some of the biggest brands in the world use Couchbase, including:
Paypal A scalable NoSQL and big data architecture with real time analytics
Concur A highly available cache solution that supports 1B operations/day
Amadeus A backend data store that supports 1.6B transactions/day
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
Delivering rapid-fire Analytics with Snowflake and TableauHarald Erb
Until recently, advancements in data warehousing and analytics were largely incremental. Small innovations in database design would herald a new data warehouse every
2-3 years, which would quickly become overwhelmed with rapidly increasing data volumes. Knowledge workers struggled to access those databases with development intensive BI tools designed for reporting, rather than exploration and sharing. Both databases and BI tools were strained in locally hosted environments that were inflexible to growth or change.
Snowflake and Tableau represent a fundamentally different approach. Snowflake’s multi-cluster shared data architecture was designed for the cloud and to handle logarithmically larger data volumes at blazing speed. Tableau was made to foster an interactive approach to analytics, freeing knowledge workers to use the speed of Snowflake to their greatest advantage.
This deck cover Microsoft Analytics Platform System (APS) formerly known as Parallel Data Warehouse (PDW). This is based on massively parallel processing technology and can typically reduce your OLAP workloads by 98%.
APS AU3 is a phenomenal technology based on SQL Server 2014 and costs a fraction of a comparable Netezza or Teradata.
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
SQL Server on Linux will provide the SQL Server database engine running natively on Linux. It allows customers choice in deploying SQL Server on the platform of their choice, including Linux, Windows, and containers. The public preview of SQL Server on Linux is available now, with the general availability target for 2017. It brings the full power of SQL Server to Linux, including features like In-Memory OLTP, Always Encrypted, and PolyBase.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
How Enterprises are Using NoSQL for Mission-Critical ApplicationsDATAVERSITY
NoSQL databases including Couchbase are increasingly being selected as the backend technology for web and mobile apps. Document databases in particular are well suited for a large number of different use cases as an operational datastore.
In this webinar, Perry Krug, Principal Solutions Architect at Couchbase, will give a brief overview of Couchbase Server, a document database and its underlying distributed architecture. In addition, Perry will share how some of the biggest brands in the world use Couchbase, including:
Paypal A scalable NoSQL and big data architecture with real time analytics
Concur A highly available cache solution that supports 1B operations/day
Amadeus A backend data store that supports 1.6B transactions/day
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
Delivering rapid-fire Analytics with Snowflake and TableauHarald Erb
Until recently, advancements in data warehousing and analytics were largely incremental. Small innovations in database design would herald a new data warehouse every
2-3 years, which would quickly become overwhelmed with rapidly increasing data volumes. Knowledge workers struggled to access those databases with development intensive BI tools designed for reporting, rather than exploration and sharing. Both databases and BI tools were strained in locally hosted environments that were inflexible to growth or change.
Snowflake and Tableau represent a fundamentally different approach. Snowflake’s multi-cluster shared data architecture was designed for the cloud and to handle logarithmically larger data volumes at blazing speed. Tableau was made to foster an interactive approach to analytics, freeing knowledge workers to use the speed of Snowflake to their greatest advantage.
This deck cover Microsoft Analytics Platform System (APS) formerly known as Parallel Data Warehouse (PDW). This is based on massively parallel processing technology and can typically reduce your OLAP workloads by 98%.
APS AU3 is a phenomenal technology based on SQL Server 2014 and costs a fraction of a comparable Netezza or Teradata.
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
SQL Server on Linux will provide the SQL Server database engine running natively on Linux. It allows customers choice in deploying SQL Server on the platform of their choice, including Linux, Windows, and containers. The public preview of SQL Server on Linux is available now, with the general availability target for 2017. It brings the full power of SQL Server to Linux, including features like In-Memory OLTP, Always Encrypted, and PolyBase.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...Rittman Analytics
Mark Rittman presented at Big Data World in London in March 2017 on data integration and data warehousing for cloud, big data, and IoT. He discussed the history of data warehousing and how it has evolved from traditional RDBMS implementations to embrace big data technologies like Hadoop. He described how cloud data warehouse offerings from Google BigQuery and Amazon Redshift combine the scalability of big data with the structure of data warehousing. Rittman also covered new approaches to ETL using data pipelines, schema discovery using machine learning, emerging open-source BI tools, and his current work in these areas.
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
Presentation from Snowflake Computing at the November 2015 Data Wranglers DC meetup.
The Cloud, Mobile and Web Applications are producing semi-structured data at an unprecedented rate. IT professionals continue to struggle capturing, transforming, and analyzing these complex data structures mixed with traditional relational style datasets using conventional MPP and/or Hadoop infrastructures. Public cloud infrastructures such as Amazon and Azure provide almost unlimited resources and scalability to handle both structured and semi-structured data (XML, JSON, AVRO) at Petabyte scale. These new capabilities coupled with traditional data management access methods such as SQL allow organizations and businesses new opportunities to leverage analytics at an unprecedented scale while greatly simplifying data pipeline architectures and providing an alternative to the "data lake".
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
Talk @ ScaleUp 360° AI Infrastructures DACH, 2021: Data scientists spend 80% and more of their time searching for and preparing data. This talk explains Snowflake’s Platform capabilities like near-unlimited data storage and instant and near-infinite compute resources and how the platform can be used to seamlessly integrate and support the machine learning libraries and tools data scientists rely on.
This document discusses how Hadoop can be used in data warehousing and analytics. It begins with an overview of data warehousing and analytical databases. It then describes how organizations traditionally separate transactional and analytical systems and use extract, transform, load processes to move data between them. The document proposes using Hadoop as an alternative to traditional data warehousing architectures by using it for extraction, transformation, loading, and even serving analytical queries.
This document outlines an agenda for a 90-minute workshop on Snowflake. The agenda includes introductions, an overview of Snowflake and data warehousing, demonstrations of how users utilize Snowflake, hands-on exercises loading sample data and running queries, and discussions of Snowflake architecture and capabilities. Real-world customer examples are also presented, such as a pharmacy building new applications on Snowflake and an education company using it to unify their data sources and achieve a 16x performance improvement.
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...Denodo
Performance is a key consideration for organizations looking to implement big data, logical data warehouse, and operational use cases. In this presentation, the technology expert demonstrates the performance aspects of using data virtualization to accelerate the delivery of fast data to end consumers.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/YMPhvE.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
From the Data Work Out event:
Performant and scalable Data Science with Dataiku DSS and Snowflake
Managing the whole process of setting up a machine learning environment from end-to-end becomes significantly easier when using cloud-based technologies. The ability to provision infrastructure on demand (IaaS) solves the problem of manually requesting virtual machines. It also provides immediate access to compute resources whenever they are needed. But that still leaves the administrative overhead of managing the ML software and the platform to store and manage the data.
A fully managed end-to-end machine learning platform like Dataiku Data Science Studio (DSS) that enables data scientists, machine learning experts, and even business users to quickly build, train and host machine learning models at scale, needs to access data from many different sources and can also access data provided by Snowflake. Storing data in Snowflake has three significant advantages: a single source of truth, shorten the data preparation cycle, scale as you go.
CRM UG Belux March 2017 - Power BI and Dynamics 365Joris Poelmans
Dynamics 365 and Power BI allow connecting to data from various sources like Excel, SQL Server, and Dynamics 365. Data can be transformed and modeled using relationships and measures. Reports can then be built visualizing the data and shared to others. Dashboards can embed Power BI reports for Dynamics 365. Direct query enables live querying of on-premises data sources through the data gateway.
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Databricks
Join this session to hear why Smartsheet decided to transition from their entirely SQL-based system to Snowflake and Databricks, and learn how that transition has made an immediate impact on their team, company and customer experience through enabling faster, informed data decisions.
Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
The document provides an overview of MongoDB and how it addresses the requirements of modern applications and enterprises. It discusses how traditional databases struggle with new demands around dynamic schemas, large volumes of data, and agile development. MongoDB supports these requirements through features like document data structures, horizontal scaling, and high performance. Case studies demonstrate how MongoDB has helped organizations build real-time views of customer data, virtualize legacy systems, and improve data distribution. The document concludes by discussing best practices for enterprise adoption of MongoDB.
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized.
AWS User Group: Building Cloud Analytics Solution with AWSDmitry Anoshin
Abebooks is one of Amazon Subsidiary and it treats data as an asset. It always looks the way to improve existing analytics solution and extract information from terabytes of data.
One of the recent initiatives was the migration from legacy DW platform to the AWS Redshift. During this journey, our data engineers met lots of challenges and sometimes tried to reinvent the wheel.
This talk will cover Abebooks journey towards Cloud DW. Moreover, we will cover the ETL tool selection process for the Cloud as well as the adoption process for the end users. This talk will help you understand the potential of the modern cloud DW and learn about our use case and save time for the future projects.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
The document discusses Oracle Analytics Cloud and its capabilities for data visualization and storytelling. It describes how the tool allows anyone to access and analyze data from various sources to gain insights. It provides rich visualization features, collaborative sharing abilities, and can be accessed on mobile, desktop or browsers to tell data-driven stories. The key benefits highlighted are that it offers powerful yet easy-to-use analytics accessible to all users.
Hadoop Summit - Sanoma self service on hadoopSander Kieft
Hadoop Summit talk about the Big Data self service platform at Sanoma from April 16th, 2015.
This talk will take you through the past, present and future of the data platform in use by Sanoma. A few years ago Sanoma set out to build a self service data platform, using a mixture of open source and commercial technology. Centre to the platform are Hadoop, Hive, Python, R and QlikView. Today the platform is scaled and in use in various Sanoma Countries and being further integrated in with other EDW and BI systems. Central question: How did this platform came to be and does Big Data self service deliver on the promise of freeing the for everyone to use?
Scaling self service on Hadoop involves moving from a traditional ETL model with limited self-service to a more agile approach using tools like Hadoop, Hive, and QlikView. This allows extracting and transforming data in Hadoop and loading it into QlikView for self-service reporting and analysis. Over time, the environment has expanded to include real-time processing using technologies like Kafka and Storm. Current challenges include improving security, data quality, code reuse and integration with other BI tools to further enable self-service analytics across the organization.
As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...Rittman Analytics
Mark Rittman presented at Big Data World in London in March 2017 on data integration and data warehousing for cloud, big data, and IoT. He discussed the history of data warehousing and how it has evolved from traditional RDBMS implementations to embrace big data technologies like Hadoop. He described how cloud data warehouse offerings from Google BigQuery and Amazon Redshift combine the scalability of big data with the structure of data warehousing. Rittman also covered new approaches to ETL using data pipelines, schema discovery using machine learning, emerging open-source BI tools, and his current work in these areas.
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
Presentation from Snowflake Computing at the November 2015 Data Wranglers DC meetup.
The Cloud, Mobile and Web Applications are producing semi-structured data at an unprecedented rate. IT professionals continue to struggle capturing, transforming, and analyzing these complex data structures mixed with traditional relational style datasets using conventional MPP and/or Hadoop infrastructures. Public cloud infrastructures such as Amazon and Azure provide almost unlimited resources and scalability to handle both structured and semi-structured data (XML, JSON, AVRO) at Petabyte scale. These new capabilities coupled with traditional data management access methods such as SQL allow organizations and businesses new opportunities to leverage analytics at an unprecedented scale while greatly simplifying data pipeline architectures and providing an alternative to the "data lake".
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
Talk @ ScaleUp 360° AI Infrastructures DACH, 2021: Data scientists spend 80% and more of their time searching for and preparing data. This talk explains Snowflake’s Platform capabilities like near-unlimited data storage and instant and near-infinite compute resources and how the platform can be used to seamlessly integrate and support the machine learning libraries and tools data scientists rely on.
This document discusses how Hadoop can be used in data warehousing and analytics. It begins with an overview of data warehousing and analytical databases. It then describes how organizations traditionally separate transactional and analytical systems and use extract, transform, load processes to move data between them. The document proposes using Hadoop as an alternative to traditional data warehousing architectures by using it for extraction, transformation, loading, and even serving analytical queries.
This document outlines an agenda for a 90-minute workshop on Snowflake. The agenda includes introductions, an overview of Snowflake and data warehousing, demonstrations of how users utilize Snowflake, hands-on exercises loading sample data and running queries, and discussions of Snowflake architecture and capabilities. Real-world customer examples are also presented, such as a pharmacy building new applications on Snowflake and an education company using it to unify their data sources and achieve a 16x performance improvement.
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...Denodo
Performance is a key consideration for organizations looking to implement big data, logical data warehouse, and operational use cases. In this presentation, the technology expert demonstrates the performance aspects of using data virtualization to accelerate the delivery of fast data to end consumers.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/YMPhvE.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
From the Data Work Out event:
Performant and scalable Data Science with Dataiku DSS and Snowflake
Managing the whole process of setting up a machine learning environment from end-to-end becomes significantly easier when using cloud-based technologies. The ability to provision infrastructure on demand (IaaS) solves the problem of manually requesting virtual machines. It also provides immediate access to compute resources whenever they are needed. But that still leaves the administrative overhead of managing the ML software and the platform to store and manage the data.
A fully managed end-to-end machine learning platform like Dataiku Data Science Studio (DSS) that enables data scientists, machine learning experts, and even business users to quickly build, train and host machine learning models at scale, needs to access data from many different sources and can also access data provided by Snowflake. Storing data in Snowflake has three significant advantages: a single source of truth, shorten the data preparation cycle, scale as you go.
CRM UG Belux March 2017 - Power BI and Dynamics 365Joris Poelmans
Dynamics 365 and Power BI allow connecting to data from various sources like Excel, SQL Server, and Dynamics 365. Data can be transformed and modeled using relationships and measures. Reports can then be built visualizing the data and shared to others. Dashboards can embed Power BI reports for Dynamics 365. Direct query enables live querying of on-premises data sources through the data gateway.
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Databricks
Join this session to hear why Smartsheet decided to transition from their entirely SQL-based system to Snowflake and Databricks, and learn how that transition has made an immediate impact on their team, company and customer experience through enabling faster, informed data decisions.
Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
The document provides an overview of MongoDB and how it addresses the requirements of modern applications and enterprises. It discusses how traditional databases struggle with new demands around dynamic schemas, large volumes of data, and agile development. MongoDB supports these requirements through features like document data structures, horizontal scaling, and high performance. Case studies demonstrate how MongoDB has helped organizations build real-time views of customer data, virtualize legacy systems, and improve data distribution. The document concludes by discussing best practices for enterprise adoption of MongoDB.
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized.
AWS User Group: Building Cloud Analytics Solution with AWSDmitry Anoshin
Abebooks is one of Amazon Subsidiary and it treats data as an asset. It always looks the way to improve existing analytics solution and extract information from terabytes of data.
One of the recent initiatives was the migration from legacy DW platform to the AWS Redshift. During this journey, our data engineers met lots of challenges and sometimes tried to reinvent the wheel.
This talk will cover Abebooks journey towards Cloud DW. Moreover, we will cover the ETL tool selection process for the Cloud as well as the adoption process for the end users. This talk will help you understand the potential of the modern cloud DW and learn about our use case and save time for the future projects.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
The document discusses Oracle Analytics Cloud and its capabilities for data visualization and storytelling. It describes how the tool allows anyone to access and analyze data from various sources to gain insights. It provides rich visualization features, collaborative sharing abilities, and can be accessed on mobile, desktop or browsers to tell data-driven stories. The key benefits highlighted are that it offers powerful yet easy-to-use analytics accessible to all users.
Hadoop Summit - Sanoma self service on hadoopSander Kieft
Hadoop Summit talk about the Big Data self service platform at Sanoma from April 16th, 2015.
This talk will take you through the past, present and future of the data platform in use by Sanoma. A few years ago Sanoma set out to build a self service data platform, using a mixture of open source and commercial technology. Centre to the platform are Hadoop, Hive, Python, R and QlikView. Today the platform is scaled and in use in various Sanoma Countries and being further integrated in with other EDW and BI systems. Central question: How did this platform came to be and does Big Data self service deliver on the promise of freeing the for everyone to use?
Scaling self service on Hadoop involves moving from a traditional ETL model with limited self-service to a more agile approach using tools like Hadoop, Hive, and QlikView. This allows extracting and transforming data in Hadoop and loading it into QlikView for self-service reporting and analysis. Over time, the environment has expanded to include real-time processing using technologies like Kafka and Storm. Current challenges include improving security, data quality, code reuse and integration with other BI tools to further enable self-service analytics across the organization.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
Our way from Drupal 6 to Thunder - Contentpool for PublishersOliverBerndt
This document outlines the transition of a publishing company from Drupal 6 to the Thunder content management system. It discusses how the company used Drupal 6 initially and evaluated different CMS options. It created a "Contentpool 1" solution in Drupal 7 to share content across sites. To improve on this, it is developing a "Contentpool 2" solution using the Thunder CMS for its modular architecture and ability to easily reuse and distribute content across subscribed sites. The document provides details on the requirements, architecture, and advantages of the new Thunder/Contentpool system over the previous Drupal implementation.
Prague data management meetup #30 2019-10-04Martin Bém
This document summarizes the agenda for the Prague Data Management Meetup on April 10, 2019. The meetup will feature a presentation from Jeff Pollock on next generation data integration patterns. The meetup series discusses topics related to data management, acquisition, storage, integration, analytics, and usage. It is an open professional group that has been running since 2015.
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsArcadia Data
Learn how HPE uses visual analytics within a data lake to create an “Industrial Internet of Things” model that solves their data analytics problem at scale.
The document discusses modernizing a data warehouse using the Microsoft Analytics Platform System (APS). APS is described as a turnkey appliance that allows organizations to integrate relational and non-relational data in a single system for enterprise-ready querying and business intelligence. It provides a scalable solution for growing data volumes and types that removes limitations of traditional data warehousing approaches.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisAmazon Web Services
Learn how to use Amazon ElastiCache with AWS IoT and AWS Lambda to create serverless solutions that let you rapidly make use of large and multisource data sets.
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...Amazon Web Services
Many data sets, such as time-series collections or Internet of Things (IoT) deployments can include huge numbers of sensor reports and other data points, which can be a challenge to manage and aggregate. Amazon ElastiCache for Redis provides an on-demand managed service with the performance and scalability to turn big data into useful information. Join us to learn how to use Amazon ElastiCache to create serverless solutions that lets you rapidly make use of large and multisource data sets.
Learning Objectives:
• Learn how to ingest and analyze sensor data using Amazon ElastiCache for Redis and the AWS IoT Service
• Learn how to use ElastiCache Redis for Time-Series data
This document provides an overview of a workshop on cloud big data architectures. The workshop covers:
1. Different types of big data solutions and when to use each, such as Hadoop, NoSQL and big relational databases.
2. Data pipelines, including ETL tools, load testing patterns and connecting clouds.
3. Querying and visualizing data through business analytics, predictive analytics and visualization tools.
4. A brief introduction to IoT and how it relates to big data.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
The document discusses data warehousing and the Data Warehouse Network. It provides an overview of the Data Warehouse Network as Europe's premier data warehousing consultancy and membership organization. It then covers various aspects of data warehousing including the differences between operational and data warehouse environments, the conceptual architecture of a data warehouse, and the evolutionary process of planning, building, and managing a data warehouse over time.
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere.
This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools.
Topics:
- Business Platform Challenges
- What Enterprise Consciousness Solves
- How ScyllaDB Empowers Enterprise Consciousness
- What can ScyllaDB do for Big Companies
- What can ScyllaDB do for smaller companies.
Data Culture Series - Keynote & Panel - 19h May - LondonJonathan Woodward
Big data. Small data. All data. You have access to an ever-expanding volume of data inside the walls of your business and out across the web. The potential in data is endless – from predicting election results to preventing the spread of epidemics. But how can you use it to your advantage to help move your business forward?
Data is growing exponentially and it’s now possible to mine and unlock insights from data in new and unexpected ways. Empower your business to take advantage of this data by harnessing the rich capabilities of Microsoft SQL Server and the familiarity of Microsoft Office to help organize, analyze, and make sense of your data—no matter the size.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
2. About me
Manager Core Services at Sanoma
Responsible for all common services, including the
Big Data platform
Work:
– Centralized services
– Data platform
– Search
Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff
24 April 20152
3. Sanoma, B2C Publishing and Learning company
2+100
2 Finnish newspapers
Over 100 magazines
24 April 2015 Presentation name3
5
TV channels in Finland
and The Netherlands
200+
Websites
100
Mobile applications on
various mobile platforms
7. Data models
Speed
Scalability
Partition tolerance
Availability / Redundancy
Cost per GB
Specialized focus
24 April 2015 Presentation name7
8. CAP (or Brewster) Theorem says:
“it is impossible for a distributed computer system
to simultaneously provide all three of the following
guarantees:
– Consistency
– Availability
– Partition tolerance”
CAP Theorem
24 April 2015 Presentation name8
A
C P
9. CAP Theorem
24 April 2015 Presentation name9
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
RDBMS
MySQL
Postgres
MS SQL
Oracle
NOSQL
NOSQL
14. Key/value stores
Storing object on key
Based on the Dynamo paper (Werner Vogels)
Products:
– Riak
– Memcache/Membase
– Tokyo Cabinet
– Redis
– Voldemort
Use cases:
– Counting
– Top lists
– Caches
– Pre-calculated optimizations
24 April 2015 Presentation name14
15. Bucket A B C
Key/Value buckets
24 April 2015 Presentation name15
User XXXX YYYY ZZZZ
Article 100 200 300
Article_<5 min. TIME> 50 100 150
24. Column stores
Lineage: Google's BigTable paper
Records with many, many columns
Distinguish between hot and cold data
Versioning
Records and columns can be sharded
Products:
– Hbase
– Cassandra
– Hypertable
Use cases:
– Analytics
– Messages
24 April 2015 Presentation name24
26. Big Data
Linage: Google GFS & Map/Reduce
Distributed data storage and processing
Advanced analytics capabilities on raw data
Schema on read
Products:
Hadoop
MPP databases
Use cases:
– Adhoc querying terabytes of data
– Data science
Predictive analytics
Model training
– Calculate recommendations
24 April 2015 Presentation name26
27. Big Data at Sanoma
Main use case for reporting and analytics, moving to
data science
A/B MVT testing evaluations
Using Qlikview as a front-end
Supply data to other environments (SAS,
Advertising, Behavioral Targeting)
Agile process for adding sources, from raw to
intermediate to modeled datawarehouse
Sanoma standard data platform, used in all Sanoma
countries
> 250 Users: dashboard users
40 daily users: analysts & developers
43 source systems, with 125 different sources
400 tables in hive
Platform:
– Cloudera Hadoop
– 40-60 nodes
– > 400TB storage
– ~2000 jobs/day
Typical data node / task tracker:
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM
24 April 2015 Presentation name27
30. Search
Keyword search can be combined with
advanced forms of ranking the results
Most of the fields go to an index
Facets can be used for analytics
Ranker can be replaced with custom logic
Products:
– Solr
– ElasticSearch
– Marklogic
Use cases:
– Content Search
– Analytics / Faceted
– Percolation
24 April 2015 Presentation name30
32. Search too
24 April 2015 Presentation name32
Content
t
Σ Result ranking
User
33. Search too
24 April 2015 Presentation name33
Content
Page
Σ Result ranking
User
34. Traditional queries: against index with existing data
What if the data does not exist at time of query?
Percolation allows registration of queries and then returning the query IDs, e.g. for notification when
new matches are available
Use case:
– Search for a tweet, but after the initial results continuously
get newly tweeted items when they come in
Search - Percolation
24 April 2015 Presentation name34
36. Graph databases
Lineage: Euler and graph theory.
Data model: Nodes & edges, both which can
hold key-value pairs
Products:
– AllegroGraph
– InfoGrid
– Neo4j
Use cases:
– Social relationships
– Content Linking (Entity linking)
24 April 2015 Presentation name36
Jan Smit
3js
Nick en Simon
Volendam
Article
1
Article
2
Article
3
38. Blob storage
Endless storage of binary data
Storing larger objects then a single machine
“Lower” price/GB compared to SAN storage
Products
– Amazon S3
– CAStor
– (Hadoop)
Use case:
– Media storage
– Archiving
24 April 2015 Presentation name38
40. RDBMS systems are a good enough for many problems
For specific problems NOSQL solutions provide a specific solution
There’s a variety of NOSQL solutions with different characteristics
NOSQL solutions will require a higher engineering effort
Summary
24 April 2015 Presentation name40
41. Dream NO SQL Architecture – Content Delivery
24 April 201541
CMS
Document storage
(MongoDB/
CouchDB)
Blob storage
(S3/
CAStor)
Search
(ElasticSearch/
Solr)
Website / Mobile
Application
42. Dream NO SQL Architecture - Analytics
24 April 201542
Event collection
Message Queue
(Kafka / Flume )
Event processing
(Storm)
Key-value
store
(Redis)
Real time
recommendations
/ targeting
Column
storage
(Cassandra/
Hbase)
Real time
Dashboarding
Big Data
(Hadoop)
Adhoc reporting &
Data science
43. CAP Theorem
24 April 2015 Presentation name43
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
MySQL Asterdata
Postgres Greenplum
MS SQL Vertica
Oracle
Dynamo Cassandra
Voldemort SimpleDB
Tokyo Cabinet CouchDB
KAI Riak
Big Table MongoDB Berkeley DB
Hypertable Terrastore MemcachDB
Hbase Scalaris Redis
Data models
Relational databases
Key-value
Column-oriented
Document-oriented