Speech given by Monica Franceschini, Solution Architecture Manager at the Big Data Competencey Center of Engineering Group, in occasion of the Data Driven Innovation Rome 2016 - Open Summit.
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistSpagoWorld
The presentation supported the speech "Drilling into Data with Apache Drill" by Tugdual Grall (Technical Evangelist, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectSpagoWorld
The presentation supported the speech "Think differently – Stream-based Microservice Architecture for Next-Generation Applications" by Fabian Wilckens (EMEA Solutions Architect, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
From the Hadoop Summit 2015 Session with Tomer Shiran.
To deliver real-time impact from big data, organizations must evolve beyond traditional analytic approaches to support a new class of agile, distributed applications. Real-time Hadoop overcomes batch programs reliant on data transformations and schema management. This session highlights how leading organizations are leveraging Hadoop and NoSQL to merge analytics and production data to make adjustments while business is happening to optimize revenue, mitigate risk and reduce operational costs. Details include how companies have achieved real-time impact on their business, collapsed data silos, and automated in-line analytics with operational data for immediate impact.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
This document discusses using the MapR Converged Data Platform for machine learning projects. It describes MapR features like the MapR filesystem, snapshots, mirrors and topologies that help support different phases of machine learning like data collection, preparation, modeling, evaluation and deployment. The document also outlines how MapR can help manage machine learning projects at scale in an enterprise environment and integrates with common ML tools. It concludes with a demo of running H2O on MapR to showcase these features in action.
Want to discover how you can get self-service data exploration capabilities on data stored in multiple formats in files or NoSQL databases? Watch this session of Free Code Fridays to get a basic understanding of Apache Drill.
Drill is an open source, low-latency query engine for Hadoop that delivers secure, interactive SQL analytics at petabyte scale. With the ability to discover schemas on-the-fly, you can get faster time-to-value without waiting for IT to prepare the data for analysis. By adhering to ANSI SQL standards, Drill does not require a learning curve and integrates seamlessly with visualization tools.
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
The document discusses the benefits of using multiple programming languages and data stores, or a "polyglot" approach, for modern applications. A polyglot approach allows using the right tool for each task, rather than trying to force a single technology to fit all needs. This improves performance, scalability, and the ability to adapt applications to changing requirements compared to traditional monolithic architectures. The document provides examples of when to use different languages and data stores and concludes that a polyglot approach makes applications easier to maintain over time.
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistSpagoWorld
The presentation supported the speech "Drilling into Data with Apache Drill" by Tugdual Grall (Technical Evangelist, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectSpagoWorld
The presentation supported the speech "Think differently – Stream-based Microservice Architecture for Next-Generation Applications" by Fabian Wilckens (EMEA Solutions Architect, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
From the Hadoop Summit 2015 Session with Tomer Shiran.
To deliver real-time impact from big data, organizations must evolve beyond traditional analytic approaches to support a new class of agile, distributed applications. Real-time Hadoop overcomes batch programs reliant on data transformations and schema management. This session highlights how leading organizations are leveraging Hadoop and NoSQL to merge analytics and production data to make adjustments while business is happening to optimize revenue, mitigate risk and reduce operational costs. Details include how companies have achieved real-time impact on their business, collapsed data silos, and automated in-line analytics with operational data for immediate impact.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
This document discusses using the MapR Converged Data Platform for machine learning projects. It describes MapR features like the MapR filesystem, snapshots, mirrors and topologies that help support different phases of machine learning like data collection, preparation, modeling, evaluation and deployment. The document also outlines how MapR can help manage machine learning projects at scale in an enterprise environment and integrates with common ML tools. It concludes with a demo of running H2O on MapR to showcase these features in action.
Want to discover how you can get self-service data exploration capabilities on data stored in multiple formats in files or NoSQL databases? Watch this session of Free Code Fridays to get a basic understanding of Apache Drill.
Drill is an open source, low-latency query engine for Hadoop that delivers secure, interactive SQL analytics at petabyte scale. With the ability to discover schemas on-the-fly, you can get faster time-to-value without waiting for IT to prepare the data for analysis. By adhering to ANSI SQL standards, Drill does not require a learning curve and integrates seamlessly with visualization tools.
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
The document discusses the benefits of using multiple programming languages and data stores, or a "polyglot" approach, for modern applications. A polyglot approach allows using the right tool for each task, rather than trying to force a single technology to fit all needs. This improves performance, scalability, and the ability to adapt applications to changing requirements compared to traditional monolithic architectures. The document provides examples of when to use different languages and data stores and concludes that a polyglot approach makes applications easier to maintain over time.
Apache Drill is a scalable SQL query engine for analysis of large-scale datasets across various data sources like HDFS, HBase, Hive and others. It allows for ad-hoc analysis of datasets without requiring knowledge of the schema beforehand. Drill uses a distributed architecture with query coordinators and workers to process queries in parallel. It supports various interfaces like JDBC, ODBC and a web console for running SQL queries on different data sources.
This document discusses how real-time data analytics can enable faster insights and actions using technologies like Apache Drill, Kafka, and MapR. It provides examples of using these technologies for real-time data exploration on ingested data via NFS and Kafka streams, as well as operational data stored in HBase. Apache Drill allows flexible SQL queries over diverse data sources without schemas. When combined with low-latency streaming and MapR's distribution, this enables applications that can take immediate action based on real-time analytics.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
Please join us to learn about the recent developments during the past year in the MapR Community Edition. In these slides, we will cover the following platform updates:
-Taking cluster monitoring to the next level with the Spyglass Initiative
-Real-time streaming with MapR Streams
-MapR-DB JSON document database and application development with OJAI
-Securing your data with access control expressions (ACEs)
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
Apache Drill is new Apache incubator project. It's goal is to provide a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel technology, it aims to process trillions of records in seconds. We will cover the goals of Apache Drill, its use cases and how it relates to Hadoop, MongoDB and other large-scale distributed systems. We'll also talk about details of the architecture, points of extensibility, data flow and our first query languages (DrQL and SQL).
Apache Spark on Apache HBase: Current and Future HBaseCon
- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
MapR-DB is an enterprise-grade, high performance, in-Hadoop NoSQL (“Not Only SQL”) database management system. It is used to add real-time, operational analytics capabilities to Hadoop and now natively support JSON.
The open source project Apache Drill gives you SQL-on-Hadoop, but with some big differences. The biggest difference is that Drill extends ANSI SQL from a strongly typed language to also a late binding language without losing performance. This allows Drill to process complex structured data like JSON in addition to relational data. By dynamically generating a schema at read time that matches the data types and structures observed in the data, Drill gives you both self-service agility and speed.
Drill also introduces a view-based security model that uses file system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These extensions have huge practical impact when it comes to writing real applications.
In these slides, Tugdual Grall, Technical Evangelist at MapR, gives several practical examples of how Drill makes it easy to analyze data, using SQL in your Java application with a simple JDBC driver.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
The document discusses MapR's distribution for Apache Hadoop. It provides an enterprise-grade and open source distribution that leverages open source components and makes targeted enhancements to make Hadoop more open and enterprise-ready. Key features include integration with other big data technologies like Accumulo, high availability, easy management at scale, and a storage architecture based on volumes to logically organize and manage data placement and policies across a Hadoop cluster.
This document provides an overview of Apache Drill and how it enables ad-hoc querying and analysis of structured and unstructured data stored in Hadoop. Some key points:
1) Apache Drill allows for schema-free SQL queries against data in HDFS, HBase, Hive and other data sources, empowering self-service data exploration and "zero-day" analytics.
2) Drill's queries can handle complex, nested data through features like automatic schema discovery, repeated value support, and SQL extensions.
3) Examples show how Drill provides a familiar SQL interface and tooling to analyze JSON, text and other file formats to gain insights from large volumes of real-time data.
Drill can query JSON data stored in various data sources like HDFS, HBase, and Hive. It allows running SQL queries over JSON data without requiring a fixed schema. The document describes how Drill enables ad-hoc querying of JSON-formatted Yelp business review data using SQL, providing insights faster than traditional approaches.
This document discusses Apache Drill, an open source SQL query engine for analyzing data in non-relational data stores like JSON, CSV, and Hadoop data formats. It provides an overview of Drill's key features such as its ability to query diverse data sources with a simple SQL interface without requiring schemas, its SQL-on-Everything model, high performance through columnar storage and execution, and its ability to scale from a single machine to large clusters. The document also demonstrates how to install Drill, configure data sources, and run queries against sample Yelp data to analyze reviews, users, and businesses.
With the general availability of the MapR Converged Data Platform 5.2, we’d like to invite our customers and partners to this webinar in which members of the MapR product team will share details about this exciting new release.
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. In this course it is explained how a simple analytical document can be developed from scratch.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. This course aims at offering assistance to create a simple Report with Birt. We drive users from installation to the development of the document through SpagoBI Studio and finally show how the report can be transfered on SpagoBI server.
Apache Drill is a scalable SQL query engine for analysis of large-scale datasets across various data sources like HDFS, HBase, Hive and others. It allows for ad-hoc analysis of datasets without requiring knowledge of the schema beforehand. Drill uses a distributed architecture with query coordinators and workers to process queries in parallel. It supports various interfaces like JDBC, ODBC and a web console for running SQL queries on different data sources.
This document discusses how real-time data analytics can enable faster insights and actions using technologies like Apache Drill, Kafka, and MapR. It provides examples of using these technologies for real-time data exploration on ingested data via NFS and Kafka streams, as well as operational data stored in HBase. Apache Drill allows flexible SQL queries over diverse data sources without schemas. When combined with low-latency streaming and MapR's distribution, this enables applications that can take immediate action based on real-time analytics.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
Please join us to learn about the recent developments during the past year in the MapR Community Edition. In these slides, we will cover the following platform updates:
-Taking cluster monitoring to the next level with the Spyglass Initiative
-Real-time streaming with MapR Streams
-MapR-DB JSON document database and application development with OJAI
-Securing your data with access control expressions (ACEs)
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
Apache Drill is new Apache incubator project. It's goal is to provide a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel technology, it aims to process trillions of records in seconds. We will cover the goals of Apache Drill, its use cases and how it relates to Hadoop, MongoDB and other large-scale distributed systems. We'll also talk about details of the architecture, points of extensibility, data flow and our first query languages (DrQL and SQL).
Apache Spark on Apache HBase: Current and Future HBaseCon
- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
MapR-DB is an enterprise-grade, high performance, in-Hadoop NoSQL (“Not Only SQL”) database management system. It is used to add real-time, operational analytics capabilities to Hadoop and now natively support JSON.
The open source project Apache Drill gives you SQL-on-Hadoop, but with some big differences. The biggest difference is that Drill extends ANSI SQL from a strongly typed language to also a late binding language without losing performance. This allows Drill to process complex structured data like JSON in addition to relational data. By dynamically generating a schema at read time that matches the data types and structures observed in the data, Drill gives you both self-service agility and speed.
Drill also introduces a view-based security model that uses file system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These extensions have huge practical impact when it comes to writing real applications.
In these slides, Tugdual Grall, Technical Evangelist at MapR, gives several practical examples of how Drill makes it easy to analyze data, using SQL in your Java application with a simple JDBC driver.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
The document discusses MapR's distribution for Apache Hadoop. It provides an enterprise-grade and open source distribution that leverages open source components and makes targeted enhancements to make Hadoop more open and enterprise-ready. Key features include integration with other big data technologies like Accumulo, high availability, easy management at scale, and a storage architecture based on volumes to logically organize and manage data placement and policies across a Hadoop cluster.
This document provides an overview of Apache Drill and how it enables ad-hoc querying and analysis of structured and unstructured data stored in Hadoop. Some key points:
1) Apache Drill allows for schema-free SQL queries against data in HDFS, HBase, Hive and other data sources, empowering self-service data exploration and "zero-day" analytics.
2) Drill's queries can handle complex, nested data through features like automatic schema discovery, repeated value support, and SQL extensions.
3) Examples show how Drill provides a familiar SQL interface and tooling to analyze JSON, text and other file formats to gain insights from large volumes of real-time data.
Drill can query JSON data stored in various data sources like HDFS, HBase, and Hive. It allows running SQL queries over JSON data without requiring a fixed schema. The document describes how Drill enables ad-hoc querying of JSON-formatted Yelp business review data using SQL, providing insights faster than traditional approaches.
This document discusses Apache Drill, an open source SQL query engine for analyzing data in non-relational data stores like JSON, CSV, and Hadoop data formats. It provides an overview of Drill's key features such as its ability to query diverse data sources with a simple SQL interface without requiring schemas, its SQL-on-Everything model, high performance through columnar storage and execution, and its ability to scale from a single machine to large clusters. The document also demonstrates how to install Drill, configure data sources, and run queries against sample Yelp data to analyze reviews, users, and businesses.
With the general availability of the MapR Converged Data Platform 5.2, we’d like to invite our customers and partners to this webinar in which members of the MapR product team will share details about this exciting new release.
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. In this course it is explained how a simple analytical document can be developed from scratch.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. This course aims at offering assistance to create a simple Report with Birt. We drive users from installation to the development of the document through SpagoBI Studio and finally show how the report can be transfered on SpagoBI server.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. Here it is shown how to set filters to a parametric Birt Report on SpagoBI Server.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. The course gradually explains how the end-user can manage the SpagoBI worksheet engine in order to build a set of analysis with charts and tables that display his own statistics.
This set of slides is part of the course Data Visualization GE, available on FIWARE platform, whose SpagoBI is the reference implementation. This course depicts the global vision over the SpagoBI suite, the policy it carries out, its usage and its main features.
This document describes SpagoBI's new data mining engine that uses the R scripting language. The engine allows users to execute R scripts and display multiple outputs. It features the JRI and Rserve libraries to interface R with Java applications. The engine works with datasets, scripts, commands, outputs, parameters, and variables. Scripts contain R code, datasets provide data, commands execute scripts and outputs display results. The template defines how these components work together in a data mining document.
Openness as the Engine for Digital InnovationSpagoWorld
Gabriele Ruffatti discusses openness and digital innovation in a presentation with three main sections. Openness is seen as an engine for digital innovation, driven by complexity, dynamism and a need for trust. Open source is presented as a development model that enables collaboration, transparency and new commercial models. The role of data, people and managers who embrace innovation are key factors for digital transformation.
1. The document discusses how data quality impacts the entire lead lifecycle from prospecting to post-sales.
2. It notes that data quality can be improved through accuracy, completeness, consistency and other measures.
3. Improving data quality through best practices like data cleansing and enrichment can increase revenue by 66% and provide a high ROI according to industry research.
Webinar - What's new with SpagoBI 5: presentation and demoSpagoWorld
This presentation supported the webinar delivered by SpagoBI Labs within SpagoBI Webinar Center in February 2015 (in English and French). It provides an overview of the new features of SpagoBI 5 through a live presentation and demo. www.spagobi.org
Webinar - SpagoBI 5 and what-if analytics: is your business strategy effective?SpagoWorld
This presentation supported the webinar delivered by Alberto Ghedin, SpagoBI Architect, in February 2015 (in English). It shows the what-if analytics provided by SpagoBI 5, allowing you to simulate scenarios and predict the effects of potential changes in your business strategies. www.spagobi.org
Open Opportunity Meeting 2012: SpagoBI use cases - The open source Business I...SpagoWorld
Stefano Scamuzzo, SpagoBI International Manager, presented SpagoBI suite at Open Opportunity Meeting (Isola Polvese, Perugia, Italy - 19th-20th April 2012), within the track dedicated to open source communities. http://www.spagobi.org/
The presentation supported the webinar on SpagoBI Suite, delivered by Chiara Chiarelli (SpagoBI Developer) through SpagoWorld Webinar Center, on 13th December 2010. http://www.spagoworld.org/
The document introduces CALIENT Technologies' VPOD (Virtual POD) solution using their S-Series 3D-MEMS Optical Circuit Switch. The VPOD allows for compute resources to be shared between physical PODs at the optical layer, improving server utilization from typically less than 40% to over 50%. This can save tens or hundreds of millions of dollars in CAPEX for large data centers. The CALIENT LightConnect switching fabric and manager allow dynamic allocation of resources between VPODs on demand, improving efficiency without impacting performance.
This document provides an overview of the HathiTrust Research Center (HTRC) architecture. It describes the key components including a portal for access, an agent for application submission, a registry for storing metadata, a secure API for programmatic access, and storage of data in a Cassandra cluster with indexing in Solr. It also outlines use cases and discusses how the architecture enables secure, non-consumptive research on copyrighted works stored in the HathiTrust digital library.
Data-center SDN is located in St.Petersburg, Russia. It is one of the largest and most modern data-centers in the North-West of Russia, constructed and operated in accordance with the Uptime Institute TIER III level recommendations. PCI DSS certified.
St.Petersburg – is one of the main gateways of IP connectivity between Russian and Europe. Data-center SDN can provide you the best connectivity with the largest Russian telecom operators (Beeline, Megafon, MTS) as well as with the international operators Orange and RETN. The data-center has direct interconnections with all the main St.Petersburg and Moscow internet exchange points.
Ownership of land ground area 7,5 acre
The design capacity of the data-center – 1437 racks 42-48U
Administrative building – 1500 sq m
Utility power supply 10 MW with increase option up to 14 МW (2 feeders 10 kV). High voltage distribution station 10 kV
Up to 8 diesel-rotary UPS, 1600 кVА each
Load per a rack – up to 40 KW
Total cooling capacity - 8,4 MW
Fuel storage 2 х 50 м3
5 sequrity perimeters
Estimated power usage effectiveness (PUE) = 1,03 – 1,2
Information Technology Innovator David Ward 2011ward2dr
David Ward is an experienced senior technology executive with over 20 years of experience in leadership roles at major financial institutions. He has expertise in areas such as technology strategies, business transformation, enterprise systems, infrastructure, and mergers and acquisitions. Throughout his career, he has delivered value to shareholders and improved customer satisfaction. Currently, he is seeking new opportunities to apply his experience and drive innovation.
Presentation data center and cloud architecturexKinAnx
This document outlines an enterprise network architecture with IBM networking equipment. It includes small branch offices connected via SOHO routers, large branches with branch routers, teleworkers connecting remotely via SSL VPN, and a headquarters site connecting various IBM servers and systems. The network uses MPLS/VPLS WAN connectivity between sites with firewalls, intrusion prevention, and security management provided at each location.
3D IT Architecture is a new revolutionary way to visualize your IT architecture to your stakeholders. Build a complete 3D model of your IT landscape by putting together a series of 3D objects. Just print it, cut it and glue it together. In this presentation: Data Center
These slides will cover the essential characteristics of cloud computing in the data center. Why should you consider adopting cloud architecture? We'll show you.
Data-Ed Webinar: Data Architecture RequirementsDATAVERSITY
Data architecture is foundational to an information-based operational environment. It is your data architecture that organizes your data assets so they can be leveraged in your business strategy to create real business value. Even though this is important, not all data architectures are used effectively. This webinar describes the use of data architecture as a basic analysis method. Various uses of data architecture to inform, clarify, understand, and resolve aspects of a variety of business problems will be demonstrated. As opposed to showing how to architect data, your presenter Dr. Peter Aiken will show how to use data architecting to solve business problems. The goal is for you to be able to envision a number of uses for data architectures that will raise the perceived utility of this analysis method in the eyes of the business.
Takeaways:
Understanding how to contribute to organizational challenges beyond traditional data architecting
How to utilize data architectures in support of business strategy
Understanding foundational data architecture concepts based on the DAMA DMBOK
Data architecture guiding principles & best practices
Impala is an open-source SQL query engine for Apache Hadoop that allows for fast, interactive queries directly against data stored in HDFS and other data storage systems. It provides low-latency queries in seconds by using a custom query engine instead of MapReduce. Impala allows users to interact with data using standard SQL and business intelligence tools while leveraging existing metadata in Hadoop. It is designed to be integrated with the Hadoop ecosystem for distributed, fault-tolerant and scalable data processing and analytics.
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
This document discusses how StreamHorizon can accelerate big data analytics pipelines. It integrates seamlessly into data processing pipelines and can process data from sources like Spark, Storm, Kafka, and file systems. StreamHorizon reduces network congestion and improves query latency for tools like Impala and Hive. It is portable across platforms and provides real-time and batch processing capabilities through integrations with tools like Storm, Kafka, and Hadoop. StreamHorizon also performs data aggregations during processing to further accelerate querying and reduce network usage.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and SparkMichael Stack
Wei Li of Alibaba
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks
In this interactive webinar, we'll walk through use cases on how you can use advanced analytics like SAS Visual Statistics and In-Memory Statistic with Hortonworks’ data platform (HDP) to reveal insights in your big data and redefine how your organization solves complex problems.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Hadoop distributed computing framework for big dataCyanny LIANG
This document provides an overview of Hadoop, an open source distributed computing framework for processing large datasets. It discusses the motivation for Hadoop, including challenges with traditional approaches. It then describes how Hadoop provides partial failure support, fault tolerance, and data locality to efficiently process big data across clusters. The document outlines the core Hadoop concepts and architecture, including HDFS for reliable data storage, and MapReduce for parallel processing. It provides examples of how Hadoop works and how organizations use it at large scale.
HBase is a NoSQL database that stores data in HDFS in a distributed, scalable, reliable way for big data. It is column-oriented and optimized for random read/write access to big data in real-time. HBase is not a relational database and relies on HDFS. Common use cases include flexible schemas, high read/write rates, and real-time analytics. Apache Phoenix provides a SQL interface for HBase, allowing SQL queries, joins, and familiar constructs to manage data in HBase tables.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
CDH is a popular distribution of Apache Hadoop and related projects that delivers scalable storage and distributed computing through Apache-licensed open source software. It addresses challenges in storing and analyzing large datasets known as Big Data. Hadoop is a framework for distributed processing of large datasets across computer clusters using simple programming models. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. The Hadoop ecosystem also includes tools like Kafka, Sqoop, Hive, Pig, Impala, HBase, Spark, Mahout, Solr, Kudu, and Sentry that provide functionality like messaging, data transfer, querying, machine learning, search, and authorization.
Similar to Architectural Evolution Starting from Hadoop (20)
[SFScon'17] More than a decade with free open source softwareSpagoWorld
The presentation supported the speech by Gabriele Ruffatti - formerly Engineering Group's Open Source Competency Center Director - at SFScon ( https://www.sfscon.it/ ) in Bozen (Italy) on November 10th, 2017.
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...SpagoWorld
The presentation supported the speech by Matteo Sartori and Michele Gabusi (Data Scientists, Engineering Group’s Big Data & Analytics Competency Center) at EclipseDay Milano 2017.
Webinar: SpagoBI 5 - Self-build your interactive cockpits, get instant insigh...SpagoWorld
This presentation supported the webinar delivered by Virginie Pasquon, SpagoBI Sales Engineer, in March 2015 (in English and French). It provides an overview of SpagoBI 5 focusing on the new self-service cockpits, to explore your data dynamically and gest instant insights. www.spagobi.org
Webinar - SpagoBI 5: here comes the Social Network analysis SpagoWorld
This presentation supported the webinar delivered by Letizia Pernigotti, SpagoBI Consultant, in March 2015 (in English). It shows the latest feature for social network listening and monitoring provided by SpagoBI 5. www.spagobi.org
SpagoBI 5 Demo Day and Workshop : Business Applications and UsesSpagoWorld
These slides supported SpagoBI Labs' presentation of SpagoBI 5 ("Business Applications and Uses" session), taking place in New York, NY on January 26th, and in Herndon, VA on January 28th, 2015. Further details on the event: http://bit.ly/1IzatIX
SpagoBI 5 Demo Day and Workshop : Technology Applications and UsesSpagoWorld
These slides supported SpagoBI Labs' presentation of SpagoBI 5 ("Technology Applications and Uses" session), taking place in New York, NY on January 26th, and in Herndon, VA on January 28th, 2015. Further details on the event: http://bit.ly/1IzatIX
Engineering and OW2 Big Data Initiative: an open approach to the data-driven ...SpagoWorld
The presentation supported the speech by Stefano Scamuzzo (SpagoBI Ecosystem Manager) in the panel entitled “Big Data: towards a data-driven society” at the workshop “Embracing Potential of Big Data” (Pisa, Italy – December 12th, 2014). http://www.spagobi.org/
OW2Con’14 – OW2 Big Data initiative: leveraging the data-driven economy with ...SpagoWorld
At OW2Con’14 – the annual international community event of OW2 – that took place in Paris from 4th to 6th November 2014, Stefano Scamuzzo (SpagoBI Ecosystem Manager) presented the OW2 Big Data initiative (http://www.ow2.org/view/Big_Data/), of which Engineering Group and SpagoBI are leading members.
OW2Con’14 – OW2 Big Data initiative: leveraging the data-driven economy with ...SpagoWorld
The presentation supported the speech by Virginie Pasquon (SpagoBI Sales Engineer) at OW2Con’14 – the annual international community event of OW2, which took place in Paris from 4th to 6th November 2014. The presentation entitled “SpagoBI 5 – Towards new analytical horizons” provides an overview of the new analytical features and strengths of SpagoBI 5.
Simpda 2014 - A living story: measuring quality of developments in a large in...SpagoWorld
The presentation supported the speech by Gabriele Ruffatti (founder of the SpagoWorld initiative) at SIMPDA 2014 (Milan, Italy - November 19-21, 2015). The presentation focuses on the innovative approach named Productivity Intelligence supported by Spago4Q - the open source analytic of SpagoBI suite for Quality and Performance Improvement- that allows companies and organizations to effectively monitor performances, improve quality practices and achieve higher capability levels. www.spagoworld.org
DrupalDay 2014 - Ecology of value and DRUPAL@Engineering: the experience of a...SpagoWorld
The presentation supported the speech given by Gabriele Ruffatti -Head of Engineering Group’s Open Source Competency Center- at DrupalDay, taking place in Milan (Italy) on 14th and 15th November 2014. www.spagoworld.org
SpagoBI 5 official presentation in ParisSpagoWorld
The presentation was shows during the official presentation of SpagoBI 5 that took place in Paris (France) on November 4th, 2014, organized by Engineering Group’s SpagoBI Labs. www.spagobi.org
Key topics are self-service cockpits, also on in-memory and mash-up technologies, what-if analytics with a brand new OLAP tool and data mining.
The agenda also focuses on recent use cases and the project roadmap. Interactive sessions will allow the audience to share knowledge, experiences and expectations.
Balanced Measurement Sets - Criteria for Improving Project Management PracticesSpagoWorld
The presentation supported the speech by Luigi Buglione (Engineering Group) at ISSRE 2014 – the 25th IEEE International Symposium on Software Reliability Engineering - taking place in Naples (Italy) from 3rd to 6th November 2014. It focuses on new analytical model designed in collaboration with the University of Milan, which allows the realization of a balanced measurement set thanks to Spago4Q (www.spago4q.org) – the open source analytic of SpagoBI suite for Quality and Performance Improvement.
Webinar - How SpagoBI 5 faces Big Data challenges to generate new business op...SpagoWorld
SpagoBI 5 takes a comprehensive approach to addressing Big Data challenges by connecting to diverse Big Data sources, extracting new information from Big Data, and visually representing Big Data in a meaningful way. It provides self-service analytics and ad hoc reporting capabilities. SpagoBI uses a semantic database to classify and categorize unstructured and heterogeneous data through domain ontologies. The system's approach allows users to define their own queries, create new datasets graphically, and refine their analyses.
Webinar - SpagoBI 5 and what-if analytics: is your business strategy effective?SpagoWorld
The document discusses SpagoBI 5 and its new what-if analytics engine. SpagoBI is a unique open source business intelligence solution that allows users to create what-if scenarios and analyze the impact on data in an interactive way. The presentation includes an agenda, information about SpagoBI and Engineering Group, an overview of the new what-if analysis tool and its features, a demo, roadmap for future enhancements, and instructions for staying updated.
Webinar - Self-build your cockpits and gain instant insights with SpagoBI 5SpagoWorld
This presentation supported the webinar delivered by SpagoBI Labs within SpagoBI Webinar Center in October 2014, in French and English. It provides an overview of the new features of SpagoBI 5, focusing on self-service cockpits, which you can build and tailor with a few clicks and simple drag & drop. Instant views allow you to enrich your enterprise data with external data sources.
Webinar - What's new in SpagoBI 5: advanced data analytics at your fingertipsSpagoWorld
This presentation supported the webinar delivered by SpagoBI Labs within SpagoBI Webinar Center in October 2014, in French, Spanish and English. It gives an overview of the new features of SpagoBI 5. It focuses on how you can easily self-build analysis on your data, to make effective business decisions.
The Business Intelligence SpagoBI suite and Big DataSpagoWorld
The presentation supported the speech by Monica Franceschini at the event dedicated to Business Intelligence, Big Data and Open Data for local turism, taking place on February 28th, 2014 at the "Aldo Capitini-Vittorio Emanuele II-Arnolfo di Cambio" college in Perugia (Italy).
Open Source, a business model based on collaborationSpagoWorld
The presentation supported the speech by Gabriele Ruffatti (founder of the SpagoWorld initiative and President of OW2) at the event dedicated to Business Intelligence, Big Data and Open Data for local turism, taking place on February 28th, 2014 at the "Aldo Capitini-Vittorio Emanuele II-Arnolfo di Cambio" college in Perugia (Italy).
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
10. Moreover…
• Adoption of a well-established solution
• Availability of support services
• Community, open source or … free version!
11. Hadoop storage
HBaseHDFS
Large data sets
Unstructured data
Write-once-read-many access
Append-only file system
Hive HQL access
High-speed writes
and scans
Fault-tolerant
Replication
Many rows/columns
Compaction
Random read-writes
Updates
Rowkey access
Data modeling
NoSQL
Untyped data
Sparse schema
High throughput
Variable columns
14. Some Hbase features:
• Just one index or primary key
• Rowkey composed by other fields
• Big denormalized tables
• Horizontal partitioning rowkey-based
• Focus on the rowkey design and table schema (data modeling)
• The ACCESS PATTERN must be known in advance!
18. • Phoenix is fast: Full table scan of 100M rows usually executed in 20 seconds (narrow
table on a medium sized cluster). This time comes down to few milliseconds if query
contains filter on key columns.
• Phoenix follows the philosophy of bringing the computation to the data by using:
• coprocessors to perform operations on the server-side thus minimizing client/server
data transfer
• custom filters to prune data as close to the source as possible. In addition, Phoenix
uses native Hbase to minimize any startup costs.
Query chunks: Phoenix chunks up your query using the region boundaries and runs them in
parallel on the client using a configurable number of threads.
The aggregation will be done in a coprocessor on the server-side
19. • OLTP
• Analytic queries
• Hbase specific
• A lightweight solution
• Who else is going to use it?
20. • Query engine + metadata store + JDBC driver
• Database over HDFS (for bulk loads and full-table scans
queries)
• HBase APIs (not accessing Hfiles directly)
• …what about performances?…
Query: select count(1) from table over 1M and 5M
rows. Data is 3 narrow columns. Number of Region
Server: 1 (Virtual Machine, HBase heap: 2GB,
Processor: 2 cores @ 3.3GHz Xeon)
21. • Query engine + metadata store + JDBC driver
• DWH over HDFS
• Runs MapReduce jobs to query HBase
• StorageHanlder to read HBase
• …what about performances?…
Query: select count(1) from table over 10M and
100M rows. Data is 5 narrow columns. Number
of Region Servers: 4 (HBase heap: 10GB,
Processor: 6 cores @ 3.3GHz Xeon)
22. • Cassandra + Spark as lightweight solution (replacing Hbase+
Spark)
• SQL-like language (CQL) + secondary indexes
• …what about the other Hadoop tools?...
23. • Converged data platform: batch+NoSQL+streaming
• MapR-FS: great for throughput and files of every size +
singolar updates
• Apache Drill as SQL-layer on Mapr-FS
• …proprietary solution…
24. • Developed by Cloudera is Open Source (->integrated with
Hadoop Ecosystem)
• Low-latency random access
• Super-fast Columnar Storage
• Designed for Next-Generation Hardware (storage based on IO
of solid state drives + experimental cache implementation)
• …beta version…
With Kudu, Cloudera promises to solve Hadoop's infamous
storage problem
InfoWorld | Sep 28, 2015
25. HBaseHDFS
Hadoop storage
highly scalable in-memory
database per MPP workloads
Fast writes, fast updates,
fast reads, fast everything
Structured data
SQL+scan use cases
Unstructured data
Deep storage
Fixed column schema
SQL+scan use cases
Any type column
schema
Gets/puts/micro
scans
26. Conclusions
• One size doesn’t fit all the different
requirements
• The choice between different Open
Source solutions is driven by the
context
• Technology evolves
• So what?
• REQUIREMENTS
• NO LOCK-IN
• PEER-REVIEWS