Entity Relationships in a Document Database at CouchConf BostonBradley Holt
Unlike relational databases, document databases like CouchDB and Couchbase do not directly support entity relationships. This talk will explore patterns of modeling one-to-many and many-to-many entity relationships in a document database. These patterns include using an embedded JSON array, relating documents using identifiers, using a list of keys, and using relationship documents.
This is a presentation on CouchDB that I gave at the New York PHP User Group. I talked about the basics of CouchDB, its JSON documents, its RESTful API, writing and querying MapReduce views, using CouchDB from within PHP, and scaling.
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
1er décembre 2015
Groupe Azure
Sujet: Introduction à DocumentDB
Conférencier: Vicent-Philippe Lauzon, Microsoft
Azure DocumentDB est une base de données de type NoSQL. Lors de cette introduction à DocumentDB, vous verrez:
• Ce qu'est une base de données NoSQL
• Comment DocumentDB se compare t-il face aux autres base de données Azure
• Comment DocumentDB se compare t-il face aux autres base de données NoSQL
• Comment créer et gérer une base DocumentDB
• Comment l'utiliser (outils + C#)
• Sécurité
• Performance / Capacité
Vincent-Philippe Lauzon est un Microsoft Azure Solution Architect & Machine Learning / Consultant Sénior chez CGI. Vous pouvez lire son blog http://vincentlauzon.com et le suivre sur Twitter https://twitter.com/vplauzon
Entity Relationships in a Document Database at CouchConf BostonBradley Holt
Unlike relational databases, document databases like CouchDB and Couchbase do not directly support entity relationships. This talk will explore patterns of modeling one-to-many and many-to-many entity relationships in a document database. These patterns include using an embedded JSON array, relating documents using identifiers, using a list of keys, and using relationship documents.
This is a presentation on CouchDB that I gave at the New York PHP User Group. I talked about the basics of CouchDB, its JSON documents, its RESTful API, writing and querying MapReduce views, using CouchDB from within PHP, and scaling.
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
1er décembre 2015
Groupe Azure
Sujet: Introduction à DocumentDB
Conférencier: Vicent-Philippe Lauzon, Microsoft
Azure DocumentDB est une base de données de type NoSQL. Lors de cette introduction à DocumentDB, vous verrez:
• Ce qu'est une base de données NoSQL
• Comment DocumentDB se compare t-il face aux autres base de données Azure
• Comment DocumentDB se compare t-il face aux autres base de données NoSQL
• Comment créer et gérer une base DocumentDB
• Comment l'utiliser (outils + C#)
• Sécurité
• Performance / Capacité
Vincent-Philippe Lauzon est un Microsoft Azure Solution Architect & Machine Learning / Consultant Sénior chez CGI. Vous pouvez lire son blog http://vincentlauzon.com et le suivre sur Twitter https://twitter.com/vplauzon
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
When deploying your service to Microsoft Azure, you have a number of options in terms of noSQL: you can install databases on Linux or Windows virtual machines by yourself, or via the marketplace, or you can use open source databases available as a service like HBase or proprietary and managed databases like Document DB. After showing these options, we'll show Document DB in more details. This is a noSQL database as a service that stores JSON.
Extensible RESTful Applications with Apache TinkerPopVarun Ganesh
Presented at Graph Day SF 2018.
You are into data analytics. You come across a source of data and you realise that it is an intuitive case for a Knowledge Graph and that there is much value to be gained by incorporating it into one. How do you take this from zero to product while ensuring that it is well-tested, extensible, scalable and plays nicely with other components and services?
Slack, with its various interactions among its users is a prime candidate for this. Join us as we take you through our journey of conceptualizing Slack user data as a knowledge graph, evaluating different frameworks, incorporating business logic using TinkerPop with an extensible DSL and exposing it all through a familiar RESTful interface that allows us to effectively handle an ever-growing and dynamic graph.
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.
Introducing Azure DocumentDB - NoSQL, No ProblemAndrew Liu
Application developers support unprecedented rates of change – functionality must rapidly evolve to meet changing customer needs and to respond to competitive pressures while user populations can grow dramatically and unpredictably. To address these realities, developers are selecting document-oriented databases for schema flexibility, scalability and high performance data storage.
In this session, we will get hands on with Azure’s NoSQL document database service. Azure DocumentDB offers full indexing of JSON documents, SQL query capabilities and multi-document transactions. Learn how to get started with Azure DocumentDB and hear about some of the recent improvements to the service.
How Solr Search Works - A tech Talk at Atlogys Delhi Office by our Senior Technologist Rajat Jain. The lecture takes a deep dive into Solr - what it is, how it works, what it does and its inbuilt architecture. A wonderful technical session with many live examples, a sneak peak into solr code and config files and a live demo. Part of Atlogys Academy Series.
The Briefcase Cluster – Enabling Big Data Everywhere MapR Technologies
The briefcase cluster is a mobile MapR cluster to collect big data in remote places like offshore platforms or airplanes. It can also serve as a private cluster for individuals looking to bring data from different IoT devices together in one privately controlled cluster.
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
Batra Computer Centre is An ISO certified 9001:2008 training Centre in Ambala.
We Provide Best Search Engine Training in Ambala. BATRA COMPUTER CENTRE provides best training in C, C++, S.E.O, Web Designing, Web Development and So many other courses are available.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
MyGym is a mobile app that lets gym members reserve equipment for their workout before they get to the gym. Gym managers can manage gym flow and easily maintain gym equipment. Project for Business of UX course to design client solution.
Team: Denise Borges, Monica Caraway, Susan Oldham, Suryaprakash Vijayaraghavan
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
When deploying your service to Microsoft Azure, you have a number of options in terms of noSQL: you can install databases on Linux or Windows virtual machines by yourself, or via the marketplace, or you can use open source databases available as a service like HBase or proprietary and managed databases like Document DB. After showing these options, we'll show Document DB in more details. This is a noSQL database as a service that stores JSON.
Extensible RESTful Applications with Apache TinkerPopVarun Ganesh
Presented at Graph Day SF 2018.
You are into data analytics. You come across a source of data and you realise that it is an intuitive case for a Knowledge Graph and that there is much value to be gained by incorporating it into one. How do you take this from zero to product while ensuring that it is well-tested, extensible, scalable and plays nicely with other components and services?
Slack, with its various interactions among its users is a prime candidate for this. Join us as we take you through our journey of conceptualizing Slack user data as a knowledge graph, evaluating different frameworks, incorporating business logic using TinkerPop with an extensible DSL and exposing it all through a familiar RESTful interface that allows us to effectively handle an ever-growing and dynamic graph.
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.
Introducing Azure DocumentDB - NoSQL, No ProblemAndrew Liu
Application developers support unprecedented rates of change – functionality must rapidly evolve to meet changing customer needs and to respond to competitive pressures while user populations can grow dramatically and unpredictably. To address these realities, developers are selecting document-oriented databases for schema flexibility, scalability and high performance data storage.
In this session, we will get hands on with Azure’s NoSQL document database service. Azure DocumentDB offers full indexing of JSON documents, SQL query capabilities and multi-document transactions. Learn how to get started with Azure DocumentDB and hear about some of the recent improvements to the service.
How Solr Search Works - A tech Talk at Atlogys Delhi Office by our Senior Technologist Rajat Jain. The lecture takes a deep dive into Solr - what it is, how it works, what it does and its inbuilt architecture. A wonderful technical session with many live examples, a sneak peak into solr code and config files and a live demo. Part of Atlogys Academy Series.
The Briefcase Cluster – Enabling Big Data Everywhere MapR Technologies
The briefcase cluster is a mobile MapR cluster to collect big data in remote places like offshore platforms or airplanes. It can also serve as a private cluster for individuals looking to bring data from different IoT devices together in one privately controlled cluster.
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
Batra Computer Centre is An ISO certified 9001:2008 training Centre in Ambala.
We Provide Best Search Engine Training in Ambala. BATRA COMPUTER CENTRE provides best training in C, C++, S.E.O, Web Designing, Web Development and So many other courses are available.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
MyGym is a mobile app that lets gym members reserve equipment for their workout before they get to the gym. Gym managers can manage gym flow and easily maintain gym equipment. Project for Business of UX course to design client solution.
Team: Denise Borges, Monica Caraway, Susan Oldham, Suryaprakash Vijayaraghavan
This project considers all the details of a small-medium scale fitness center. The Fitness center for which the Database system is designed is a multi-branch fitness center which houses exercise equipment for the purpose of physical exercise. It also includes facilities like cardio workout session, group exercise classes (like aerobics, yoga etc.), personal training and also houses sauna and steam shower facilities.
The 20th annual Enterprise Data World (EDW) Conference took place in San Diego last month April 17-21. It is recognized as the most comprehensive educational conference on data management in the world.
Joe Caserta was a featured presenter. His session “Evolving from the Data Warehouse to Big Data Analytics - the Emerging Role of the Data Lake," highlighted the challenges and steps to needed to becoming a data-driven organization.
Joe also participated in in two panel discussions during the show:
• "Data Lake or Data Warehouse?"
• "Big Data Investments Have Been Made, But What's Next
For more information on Caserta Concepts, visit our website at http://casertaconcepts.com/.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
MongoDB Days Germany: Data Processing with MongoDBMongoDB
Presented by Marc Schwering, Senior Solutions Architect, MongoDB
Modern architectures are moving away from "one size fits all" solutions. The best tools need to be put to the job and given the large amounts of options today, chances are that you’ll end up using MongoDB for your operational workload, as well as Data Processing Systems like Apache Flink or Spark for your high speed data processing needs. When documents or data structures are modeled, there are some key aspects that need to be attended. This takes into consideration the distribution of data nodes, streaming capabilities, performance, aggregation, and queryability options, and how we can integrate the different data processing software that can benefit from subtle but substantial model changes. This session will cover the way how you enhance your architecture using data processing technologies such as Apache Flink and Spark. It will take the audience through the evolution of an app from simple to complex with its architectural requirements . We´ll look into similarities and differences of available technologies and you will walk away with an understanding how to use MongoDB to fulfill more advanced tasks such as personalization through clustering algorithms.
The Fine Art of Schema Design in MongoDB: Dos and Don'tsMatias Cascallares
Schema design in MongoDB can be an art. Different trade offs should be considered when designing how to store your data. In this presentation we are going to cover some common scenarios, recommended practices and don'ts to avoid based on previous experiences
Building a complete social networking platform presents many challenges at scale. Socialite is a reference architecture and open source Java implementation of a scalable social feed service built on DropWizard and MongoDB. We'll provide an architectural overview of the platform, explaining how you can store an infinite timeline of data while optimizing indexing and sharding configuration for access to the most recent window of data. We'll also dive into the details of storing a social user graph in MongoDB.
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
Presentation from Snowflake Computing at the November 2015 Data Wranglers DC meetup.
The Cloud, Mobile and Web Applications are producing semi-structured data at an unprecedented rate. IT professionals continue to struggle capturing, transforming, and analyzing these complex data structures mixed with traditional relational style datasets using conventional MPP and/or Hadoop infrastructures. Public cloud infrastructures such as Amazon and Azure provide almost unlimited resources and scalability to handle both structured and semi-structured data (XML, JSON, AVRO) at Petabyte scale. These new capabilities coupled with traditional data management access methods such as SQL allow organizations and businesses new opportunities to leverage analytics at an unprecedented scale while greatly simplifying data pipeline architectures and providing an alternative to the "data lake".
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake.
During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables.
This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.
The web has changed! Users spend more time on mobile than on desktops and they expect to have an amazing user experience on both platforms. APIs are the heart of the new web as the central point of access data, encapsulating logic and providing the same data and same features for desktops and mobiles.
In this talk, I will show you how in only 45 minutes we can create full REST API, with documentation and admin application build with React.
JSON_TO_HIVE_SCHEMA_GENERATOR is a tool that effortlessly converts your JSON data to Hive schema, which then can be used with HIVE to carry out processing of data. It is designed to automatically generate hive schema from JSON Data. It keeps into account various issues(multiple JSON objects per file, NULL Values, the absence of certain fields etc..) and can parse millions of records and obtain a schema definition for data i:e nested structures.
Follow : https://github.com/jainpayal12/Json_To_HiveSchema_Generator.git
Application development with Oracle NoSQL Database 3.0Anuj Sahni
Oracle announced Oracle NoSQL Database 3.0 on April 2, 2014. This release offers increased security, simplified data modeling, secondary indices, and multi-datacenter performance enhancement.
For audio/video presentation visit: http://bit.ly/1qLEZW9
Combine Spring Data Neo4j and Spring Boot to quicklNeo4j
Speakers: Michael Hunger (Neo Technology) and Josh Long (Pivotal)
Spring Data Neo4j 3.0 is here and it supports Neo4j 2.0. Neo4j is a tiny graph database with a big punch. Graph databases are imminently suited to asking interesting questions, and doing analysis. Want to load the Facebook friend graph? Build a recommendation engine? Neo4j's just the ticket. Join Spring Data Neo4j lead Michael Hunger (@mesirii) and Spring Developer Advocate Josh Long (@starbuxman) for a look at how to build smart, graph-driven applications with Spring Data Neo4j and Spring Boot.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
How Data-Driven Approaches are Changing Your Data Management Strategies
Introducing data-driven strategies into your business model alters the way your organization manages and provides information to your customers, partners and employees. Gone are the days of “waterfall” implementation strategies from relational data to applications within a data center. Now, data-driven business models require agile implementation of applications based on information from all across an organization–on-premises, cloud, and mobile–and includes information from outside corporate walls from partners, third-party vendors, and customers. Data management strategies need to be ready to meet these challenges or your new and disruptive business models will fail at the most critical time: when your customers want to access it.
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
How Rendezvous Architecture Improves Evaluation in the Real World
In this addition of our machine learning logistics webinar series we build on the ideas of the key requirements for effective management of machine learning logistics presented in the Overview webinar and in Part I Workshop. Here we focus on model-to-model comparison & evaluation, use of decoy models and more. Listen here: http://info.mapr.com/machine-learning-workshop2.html?_ga=2.35695522.324200644.1511891424-416597139.1465233415
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
MapR has launched the MapR Data Science Refinery which leverages a scalable data science notebook with native platform access, superior out-of-the-box security, and access to global event streaming and a multi-model NoSQL database.
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
Big data technologies are being applied to a wide variety of use cases. We will review tangible examples of machine learning, discuss an autonomous driving project and illustrate the role of MapR in next generation initiatives. More: http://info.mapr.com/WB_Machine-Learning-for-Chickens_Global_DG_17.11.02_RegistrationPage.html
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
Having heard the high-level rationale for the rendezvous architecture in the introduction to this series, we will now dig in deeper to talk about how and why the pieces fit together. In terms of components, we will cover why streams work, why they need to be persistent, performant and pervasive in a microservices design and how they provide isolation between components. From there, we will talk about some of the details of the implementation of a rendezvous architecture including discussion of when the architecture is applicable, key components of message content and how failures and upgrades are handled. We will touch on the monitoring requirements for a rendezvous system but will save the analysis of the recorded data for later. Listen to the webinar on demand: https://mapr.com/resources/webinars/machine-learning-workshop-1/
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
Join Ellen Friedman, co-author (with Ted Dunning) of a new short O’Reilly book Machine Learning Logistics: Model Management in the Real World, to look at what you can do to have effective model management, including the role of stream-first architecture, containers, a microservices approach and a DataOps style of work. Ellen will provide a basic explanation of a new architecture that not only leverages stream transport but also makes use of canary models and decoy models for accurate model evaluation and for efficient and rapid deployment of new models in production.
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
Data warehouses have been the standard tool for analyzing data created by business operations. In recent years, increasing data volumes, new types of data formats, and emerging analytics technologies such as machine learning have given rise to modern data lakes. Connecting application databases, data warehouses, and data lakes using real-time data pipelines can significantly improve the time to action for business decisions. More: http://info.mapr.com/WB_MapR-StreamSets-Data-Warehouse-Modernization_Global_DG_17.08.16_RegistrationPage.html
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
For this talk we will explore the power of streaming real time events in the context of the IoT and smart cities.
http://info.mapr.com/WB_Streaming-Real-Time-Events_Global_DG_17.08.02_RegistrationPage.html
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
Deploying storage with a forklift is so 1990s, right? Today’s applications and infrastructure demand systems and services that scale. Customers require performance and capacity to fit the use case and workloads, not the other way around. Architects need multi-temperature, multi-location, highly available, and compliance friendly platforms that grow with the generational shift in data growth and utility.
Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a service. Though originally used within the telecommunications industry, it has become common practice for banks, ISPs, insurance firms, and other verticals. More: http://info.mapr.com/WB_PredictingChurn_Global_DG_17.06.15_RegistrationPage.html
The prediction process is data-driven and often uses advanced machine learning techniques. In this webinar, we'll look at customer data, do some preliminary analysis, and generate churn prediction models – all with Spark machine learning (ML) and a Zeppelin notebook.
Spark’s ML library goal is to make machine learning scalable and easy. Zeppelin with Spark provides a web-based notebook that enables interactive machine learning and visualization.
In this tutorial, we'll do the following:
Review classification and decision trees
Use Spark DataFrames with Spark ML pipelines
Predict customer churn with Apache Spark ML decision trees
Use Zeppelin to run Spark commands and visualize the results
An Introduction to the MapR Converged Data PlatformMapR Technologies
Listen to the webinar on-demand: http://info.mapr.com/WB_Partner_CDP_Intro_EMEA_DG_17.05.31_RegistrationPage.html
In this 90-minute webinar, we discuss:
- The MapR Converged Data Platform and its components
- Use cases for the Converged Data Platform
- MapR Converged Partner Program
- How to get started with MapR
- Becoming a partner
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
IT budgets are shrinking, and the move to next-generation technologies is upon us. The cloud is an option for nearly every company, but just because it is an option doesn’t mean it is always the right solution for every problem.
Most cloud providers would prefer that every customer be tightly coupled with their proprietary services and APIs to create lock-in with that cloud provider. The savvy customer will leverage the cloud as infrastructure and stay loosely bound to a cloud provider. This creates an opportunity for the customer to execute a multicloud strategy or even a hybrid on-premises and cloud solution.
Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations. Along the way, Jim discusses security, backups, event streaming, databases, replication, and snapshots across a variety of use cases that run most businesses today.
Is your organization at the analytics crossroads? Have you made strides collecting and sharing massive amounts of data from electronic health records, insurance claims, and health information exchanges but found these efforts made little impact on efficiency, patient outcomes, or costs?
Changes in how business is done combined with multiple technology drivers make geo-distributed data increasingly important for enterprises. These changes are causing serious disruption across a wide range of industries, including healthcare, manufacturing, automotive, telecommunications, and entertainment. Technical challenges arise with these disruptions, but the good news is there are now innovative solutions to address these problems. http://info.mapr.com/WB_Geo-distributed-Big-Data-and-Analytics_Global_DG_17.05.16_RegistrationPage.html
MapR announced a few new releases in 2017, and we want to go over those exciting new products and features that are available now. We’d like to invite our customers and partners to this webinar in which members of the MapR product team will share details about the latest updates.
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
SAP® HANA and SAP® IQ are popular platforms for various analytical and transactional use cases. If you’re an SAP customer, you’ve experienced the benefits of deploying these solutions. However, as data volumes grow, you’re likely asking yourself: How do I scale storage to support these applications? How can I have one platform for various applications and use cases?
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
SAP HANA is an increasingly popular platform for various analytical and transactional use cases with its in-memory architecture. If you’re an SAP customer you’ve experienced the benefits.
However, the underlying storage for SAP HANA is painfully expensive. This slows down your ability to grow your SAP HANA footprint and serve up more applications.
You’re not the only one still loading your data into data warehouses and building marts or cubes out of it. But today’s data requires a much more accessible environment that delivers real-time results. Prepare for this transformation because your data platform and storage choices are about to undergo a re-platforming that happens once in 30 years.
With the MapR Converged Data Platform (CDP) and Cisco Unified Compute System (UCS), you can optimize today’s infrastructure and grow to take advantage of what’s next. Uncover the range of possibilities from re-platforming by intimately understanding your options for density, performance, functionality and more.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
4. Hadoop
workloads
and
APIs
Use
case
ETL
and
aggregaDon
(batch)
PredicDve
modeling
and
analyDcs
(batch)
InteracDve
SQL
–
Data
exploraDon,
Adhoc
queries
&
reporDng
Search
OperaDonal
(user
facing
applicaDons,
point
queries)
API
MapReduce
Hive
Pig
Cascading
Mahout
MLLib
Spark
Drill
Shark
Impala
Hive
on
Tez
Presto
Solr
ElasDcsear
ch
HBase
API
Phoenix
5. InteracDve
SQL
and
Hadoop
• Opens up Hadoop data to broader
audience
– Existing SQL skill sets
– Broad eco system of tools
• New and improved BI/Analytics
use cases
– Analysis on more raw data, new
types of data and real time data
• Cost savings
Enterprise users
6. Data landscape is changing
New
types
of
applica<ons
• Social,
mobile,
Web,
“Internet
of
Things”,
Cloud…
• IteraDve/Agile
in
nature
• More
users,
more
data
New
data
models
&
data
types
• Flexible
(schema-‐less)
data
• Rapidly
changing
• Semi-‐structured/Nested
data
{
"data":
[
"id":
"X999_Y999",
"from":
{
"name":
"Tom
Brady",
"id":
"X12"
},
"message":
"Looking
forward
to
2014!",
"acDons":
[
{
"name":
"Comment",
"link":
"hhp://www.facebook.com/X99/posts
Y999"
},
{
"name":
"Like",
"link":
"hhp://www.facebook.com/X99/posts
Y999"
}
],
"type":
"status",
"created_Dme":
"2013-‐08-‐02T21:27:44+0000",
"updated_Dme":
"2013-‐08-‐02T21:27:44+0000"
}
}
JSON
7. Tradi<onal
datasets
• Comes
from
transacDonal
applicaDons
• Stored
for
historical
purposes
and/or
for
large
scale
ETL/AnalyDcs
• Well
defined
schemas
• Managed
centrally
by
DBAs
• No
frequent
changes
to
schema
• Flat
datasets
New
datasets
• Comes
from
new
applicaDons
(Ex:
Social
feeds,
clickstream,
logs,
sensor
data)
• Enable
new
use
cases
such
as
Customer
SaDsfacDon,
Product/Service
opDmizaDon
• Flexible
data
models/managed
within
applicaDons
• Schemas
evolving
rapidly
• Semi-‐structured/Nested
data
Hadoop evolving as central hub for analysis
Provides
Cost
effecDve,
flexible
way
to
store
and
and
process
data
at
scale
8. ExisDng
SQL
approaches
will
not
always
work
for
big
data
needs
• New
data
models/types
don’t
map
well
to
the
relaDonal
models
– Many
data
sources
do
not
have
rigid
schemas
(HBase,
Mongo
etc)
• Each
record
has
a
separate
schema
• Sparse
and
wide
rows
– Flahening
nested
data
is
error-‐prone
and
oten
impossible
• Think
about
repeated
and
opDonal
fields
at
every
level…
• A
single
HBase
value
could
be
a
JSON
document
(compound
nested
type)
• Centralized
schemas
are
hard
to
manage
for
big
data
• Rapidly
evolving
data
source
schemas
• Lots
of
new
data
sources
• Third
party
data
• Unknown
quesDons
Model
data
Move
data
into
tradi<onal
systems
New questions
/requirements
Schema changes or
new data sources
DBA/DWH teams
Analyze Big data
Enterprise Users
9. Apache
Drill
Open
Source
SQL
on
Hadoop
for
Agility
with
Big
Data
explora<on
FLEXIBLE
SCHEMA
MANAGEMENT
ANALYTICS
ON
NOSQL
DATA
PLUG
AND
PLAY
WITH
EXISTING
TOOLS
Analyze
data
with
or
without
centralized
schemas
Analyze
data
using
familiar
BI/AnalyDcs
and
SQL
based
tools
Analyze
semi
structured
&
nested
data
with
no
modeling/ETL
11. Drill
{
“ID”:
1,
“NAME”:
“Fairmont
San
Francisco”,
“DESCRIPTION”:
“Historic
grandeur…”,
“AVG_REVIEWER_SCORE”:
“4.3”,
“AMENITY”:
{“TYPE”:
“gym”,
DESCRIPTION:
“fitness
center”
},
{“TYPE”:
“wifi”,
“DESCRIPTION”:
“free
wifi”},
“RATE_TYPE”:
“nightly”,
“PRICE”:
“$199”,
“REVIEWS”:
[“review_1”,
“review_2”],
“ATTRACTIONS”:
“Chinatown”,
}
JSON
Drill
Flexible
schema
management
HotelID
AmenityID
1
1
1
2
ID
Type
Descrip
Don
1
Gym
Fitness
center
2
Wifi
Free
wifi
Drill
doesn’t
require
any
schema
defini<ons
to
query
data
making
it
faster
to
get
insights
from
data
for
users.
Drill
leverages
schema
defini<ons
if
exists.
12. Key
features
• Dynamic/schema-‐less
queries
• Nested
data
• Apache
Hive
integraDon
• ANSI
SQL/BI
tool
integraDon
13. Querying
files
• Direct
queries
on
a
local
or
a
distributed
file
system
(HDFS,
S3
etc)
• Configure
one
or
more
directories
in
file
system
as
“Workspaces”
– Think
of
this
as
similar
to
schemas
in
databases
– Default
workspace
points
to
“root”
locaDon
• Specify
a
single
file
or
a
directory
as
‘Table’
within
query
• Specify
schema
in
query
or
let
Drill
discover
it
• Example:
• SELECT * FROM dfs.users.`/home/mapr/sample-data/profiles.json`!
!!
dfs
File
system
as
data
source
users
Workspace
(corresponds
to
a
directory)
/home/mapr/sample-data/
profiles.json!
Table
14. More
examples
• Query
on
single
file
SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan/part0001.txt`!
• Query
on
directory
SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan` where
errorLevel=1;!
• Joins
on
files
SELECT c.c_custkey,sum(o.o_totalprice) !
FROM!
!dfs.`/home/mapr/tpch/customer.parquet` c !
!JOIN!
!dfs.`/home/mapr/tpch/orders.parquet` o!
!ON c.c_custkey = o.o_custkey!
GROUP BY c.c_custkey !
LIMIT 10!
15. Querying
HBase
• Direct
queries
on
HBase
tables
– SELECT row_key, cf1.month, cf1.year FROM
hbase.table1;!
– SELECT CONVERT_FROM(row_key, UTF-8) as HotelName from
FROM HotelData
• No
need
to
define
a
parallel/overlay
schema
in
Hive
• Encode
and
Decode
data
from
HBase
using
Convert
funcDons
– Convert_To
and
Convert_From
!
16. Nested
data
• Nested
data
as
first
class
enDty:
Extensions
to
SQL
for
nested
data
types,
similar
to
BigQuery
• No
upfront
flahening/modeling
required
• Generic
architecture
for
a
broad
variety
of
nested
data
types
(eg:JSON,
BSON,
XML,
AVRO,
Protocol
Buffers)
• Performance
with
ground
up
design
for
nested
data
• Example:
SELECT!
!c.name, c.address, REPEATED_COUNT(c.children) !
FROM(!
SELECT!
! !CONVERT_FROM(cf1.user-json-blob, JSON) AS c !
FROM!
!hbase.table1!
)
17.
Apache
Hive
integraDon
• Plug
and
Play
integraDon
in
exisDng
Hive
deployments
• Use
Drill
to
query
data
in
Hive
tables/views
• Support
to
work
with
more
than
one
Hive
metastore
• Support
for
all
Hive
file
formats
• Ability
to
use
Hive
UDFs
as
part
of
Drill
queries
Hive
meta
store
Files
HBase
Hive
SQL
layer
Drill
SQL
layer
+
execuDon
engine
MapReduce
execuDon
framework
18. Cross
data
source
queries
• Combine
data
from
Files,
HBase,
Hive
in
one
query
• No
central
metadata
definiDons
necessary
• Example:
– USE HiveTest.CustomersDB!
– SELECT Customers.customer_name, SocialData.Tweets.Count!
FROM Customers!
JOIN HBaseCatalog.SocialData SocialData !
ON Customers.Customer_id = Convert_From(SocialData.rowkey, UTF-8) !
19. BI
tool
integraDon
• Standard
JDBC/ODBC
drivers
• IntegraDon
Tableau,
Excel,
Microstrategy,
Toad,
SQuirreL...
20. SQL
support
• ANSI
SQL
compaDbility
– “SQL
Like”
not
enough
• SQL
data
types
– SMALLINT, BIGINT, TINYINT, INT, FLOAT, DOUBLE,DATE, TIMESTAMP, DECIMAL, VARCHAR,
VARBINARY ….!
• All
common
SQL
constructs
• SELECT, GROUP BY, ORDER BY, LIMIT, JOIN, HAVING, UNION, UNION ALL, IN/NOT
IN, EXISTS/NOT EXISTS,DISTINCT, BETWEEN, CREATE TABLE/VIEW AS ….!
• Scalar and correlated sub queries!
• Metadata
discovery
using
INFORMATION_SCHEMA
• Support
for
datasets
that
do
not
fit
in
memory
21. Packaging/install
• Works
on
all
Hadoop
distribuDons
• Easy
ramp
up
with
embedded/standalone
mode
– Try
out
Drill
easily
on
your
machine
– No
Hadoop
requirement
23. High
Level
Architecture
• Drillbits
run
on
each
node,
designed
to
maximize
data
locality
• Drill
includes
a
distributed
execuDon
environment
built
specifically
for
distributed
query
processing
• Any
Drillbit
can
act
as
endpoint
for
parDcular
query.
• Zookeeper
maintains
ephemeral
cluster
membership
informaDon
only
• Small
distributed
cache
uDlizing
embedded
Hazelcast
maintains
informaDon
about
individual
queue
depth,
cached
query
plans,
metadata,
locality
informaDon,
etc.
Zookeeper
Storage
Process
Storage
Process
Storage
Process
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
24. Basic
query
flow
Zookeeper
DFS/HBase
DFS/HBase
DFS/HBase
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Query
1.
Query
comes
to
any
Drillbit
(JDBC,
ODBC,
CLI)
2.
Drillbit
generates
execuDon
plan
based
on
query
opDmizaDon
&
locality
3.
Fragments
are
farmed
to
individual
nodes
4.
Data
is
returned
to
driving
node
25. Core
Modules
within
a
Drillbit
SQL
Parser
OpDmizer
Physical
Plan
DFS
HBase
RPC
Endpoint
Distributed
Cache
Storage
Engine
Interface
Logical
Plan
ExecuDon
Hive
26. Query
ExecuDon
• Source
query—what
we
want
to
do
(analyst
friendly)
• Logical
Plan—
what
we
want
to
do
(language
agnosDc,
computer
friendly)
• Physical
Plan—how
we
want
to
do
it
(the
best
way
we
can
tell)
• Execu<on
Plan—where
we
want
to
do
it
27. A
Query
engine
that
is…
• OpDmisDc/pipelined
• Columnar/Vectorized
• RunDme
compiled
• Late
binding
• Extensible
28. OpDmisDc
ExecuDon
• With
a
short
Dme
horizon,
failures
infrequent
– Don’t
spend
energy
and
Dme
creaDng
boundaries
and
checkpoints
to
minimize
recovery
Dme
– Rerun
enDre
query
in
face
of
failure
• No
barriers
• No
persistence
unless
memory
overflow
29. RunDme
CompilaDon
• Give
JIT
help
• Avoid
virtual
method
invocaDon
• Avoid
heap
allocaDon
and
object
overhead
• Minimize
memory
overhead
31. Data
Format
Example
Donut
Price
Icing
Bacon
Maple
Bar
2.19
[Maple
FrosDng,
Bacon]
Portland
Cream
1.79
[Chocolate]
The
Loop
2.29
[Vanilla,
Fruitloops]
Triple
Chocolate
PenetraDon
2.79
[Chocolate,
Cocoa
Puffs]
Record
Encoding
Bacon
Maple
Bar,
2.19,
Maple
FrosDng,
Bacon,
Portland
Cream,
1.79,
Chocolate
The
Loop,
2.29,
Vanilla,
Fruitloops,
Triple
Chocolate
PenetraDon,
2.79,
Chocolate,
Cocoa
Puffs
Columnar
Encoding
Bacon
Maple
Bar,
Portland
Cream,
The
Loop,
Triple
Chocolate
PenetraDon
2.19,
1.79,
2.29,
2.79
Maple
FrosDng,
Bacon,
Chocolate,
Vanilla,
Fruitloops,
Chocolate,
Cocoa
Puffs
32. Example:
RLE
and
Sum
• Dataset
– 2,
4
– 8,
10
• Goal
– Sum
all
the
records
• Normal
Work
– Decompress
&
store:
2,
2,
2,
2,
8,
8,
8,
8,
8,
8,
8,
8,
8,
8
– Add:
2
+
2
+
2
+
2
+
8
+
8
+
8
+
8
+
8
+
8
+
8
+
8
+
8
+
8
• OpDmized
Work
– 2
*
4
+
8
*
10
– Less
Memory,
less
operaDons
33. Record
Batch
• Drill
opDmizes
for
BOTH
columnar
STORAGE
and
ExecuDon
• Record
Batch
is
unit
of
work
for
the
query
system
– Operators
always
work
on
a
batch
of
records
• All
values
associated
with
a
parDcular
collecDon
of
records
• Each
record
batch
must
have
a
single
defined
schema
• Record
batches
are
pipelined
between
operators
and
nodes
RecordBatch
VV
VV
VV
VV
RecordBatch
VV
VV
VV
VV
RecordBatch
VV
VV
VV
VV
34. Strengths
of
RecordBatch
+
ValueVectors
• RecordBatch
clearly
delineates
low
overhead/high
performance
space
– Record-‐by-‐record,
avoid
method
invocaDon
– Batch-‐by-‐batch,
trust
JVM
• Avoid
serializaDon/deserializaDon
• Off-‐heap
means
large
memory
footprint
without
GC
woes
• Full
specificaDon
combined
with
off-‐heap
and
batch-‐level
execuDon
allows
C/C++
operators
as
necessary
• Random
access:
sort
without
copy
or
restructuring
35. Late
Schema
Binding
• Schema
can
change
over
course
of
query
• Operators
are
able
to
reconfigure
themselves
on
schema
change
events
36. IntegraDon
and
Extensibility
points
• Support
UDFs
– UDFs/UDAFs
using
high
performance
Java
API
• Not
Hadoop
centric
– Work
with
other
NoSQL
soluDons
including
MongoDB,
Cassandra,
Riak,
etc.
– Build
one
distributed
query
engine
together
than
per
technology
• Built
in
classpath
scanning
and
plugin
concept
to
add
addiDonal
storage
engines,
funcDon
and
operators
with
zero
configuraDon
• Support
direct
execuDon
of
strongly
specified
JSON
based
logical
and
physical
plans
– Simplifies
tesDng
– Enables
integraDon
of
alternaDve
query
languages
37. Comparison
with
MapReduce
• Barriers
– Map
compleDon
required
before
shuffle/reduce
commencement
– All
maps
must
complete
before
reduce
can
start
– In
chained
jobs,
one
job
must
finish
enDrely
before
the
next
one
can
start
• Persistence
and
Recoverability
– Data
is
persisted
to
disk
between
each
barrier
– SerializaDon
and
deserializaDon
are
required
between
execuDon
phase
39. Status
• Heavy
acDve
development
• Significant
community
momentum
– ~15+
contributors
– 400+
people
in
Drill
mailing
lists
– 400+
members
in
Bay
area
Drill
user
group
• Current
state
:
Alpha
• Timeline
1.0
Beta
(End
of
Q2,
2014)
1.0
GA
(Q3,
2014)
40. Roadmap
• Low-latency SQL
• Schema-less execution
• Files & HBase/M7 support
• Hive integration
• ANSI SQL + Extensions for
nested data
• BI and SQL tool support
via ODBC/JDBC
Data exploration/ad-hoc
queries
1.0
• HBase query speedup
• Rich nested data API
• Analytical functions
• YARN integration
• Security
Advanced analytics and
operational data
1.1
• Ultra low latency queries
• Single row insert/update/
delete
• Workload management
Operational SQL
2.0
41. Interested
in
Apache
Drill?
• Join
the
community
– Join
the
Drill
mailing
lists
• drill-‐user@incubator.apache.org
• drill-‐dev@incubator.apache.org
– Contribute
• Use
cases/Sample
queries,
JIRAs,
code,
unit
tests,
documentaDon,
...
– Fork
us
on
GitHub:
hhp://github.com/apache/incubator-‐drill/
– Create
a
JIRA:
hhps://issues.apache.org/jira/browse/DRILL
• Resources
– Try
out
Drill
in
10mins
– hhp://incubator.apache.org/drill/
– hhps://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki