Spark SQL allows users to perform relational operations on Spark's RDDs using a DataFrame API. It addresses challenges in existing systems like limited optimization and data sources by providing a DataFrame API that can query both external data and RDDs. Spark SQL leverages a highly extensible optimizer called Catalyst to optimize logical query plans into efficient physical query plans using features of Scala. It has been part of the Spark core distribution since version 1.0 in 2014.
Introduction about Data Stage ,
Difference between Data Stage 7.5.2 and 8.0.1,
What's new in Data Stage 8.0.1? ,
What is way ahead in Data Stage? ,
IBM Information Server architecture ,
Datastage within the IBM Information Server architecture ,
Difference between Server Jobs and Parallel Jobs
Difference between Pipeline Parallelism and Partition, Parallelism ,
Partition techniques (Round Robin, Random,
Introduction about Data Stage ,
Difference between Data Stage 7.5.2 and 8.0.1,
What's new in Data Stage 8.0.1? ,
What is way ahead in Data Stage? ,
IBM Information Server architecture ,
Datastage within the IBM Information Server architecture ,
Difference between Server Jobs and Parallel Jobs
Difference between Pipeline Parallelism and Partition, Parallelism ,
Partition techniques (Round Robin, Random,
Data warehouses are time variant in the sense because they maintain both
historical and (nearly) current data. Operational databases, in contrast, contain only the most
current, up-to-date data values. Furthermore, they generally maintain this information for not
more than a year. In case of DWs, these are generally loaded from the operational databases
daily, weekly, or monthly which is then typically maintained for a long period.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Data Warehouse:
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support.
Extraction:
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.
Data Transformation:
Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse.
Data Loading:
During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.
This presenation explains basics of ETL (Extract-Transform-Load) concept in relation to such data solutions as data warehousing, data migration, or data integration. CloverETL is presented closely as an example of enterprise ETL tool. It also covers typical phases of data integration projects.
Azure Data Factory Data Wrangling with Power QueryMark Kromer
ADF has embedded Power Query in Data Factory for a code-free / data-first data wrangling experience. Use the Power Query spreadsheet-style interface in your data factory to explore and prep your data, then execute your M script at scale on ADF's Spark data flow integration runtimes.
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata
Live Webcast on May 20, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=f09e84f88e4ca6e0a9179c9a9e930b82
Traditional data warehouses have been the backbone of corporate decision making for over three decades. With the emergence of Big Data and popular technologies like open-source Apache™ Hadoop®, some analysts question the lifespan of the data warehouse and the future role it will play in enterprise information management. But it’s not practical to believe that emerging technologies provide a wholesale replacement of existing technologies and corporate investments in data management. Rather, a better approach is for new innovations and technologies to complement and build upon existing solutions.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains where tomorrow’s data warehouse fits in the information landscape. He’ll be briefed by Imad Birouty of Teradata, who will highlight the ways in which his company is evolving to meet the challenges presented by different types of data and applications. He will also tout Teradata’s recently-announced Teradata® Database 15 and Teradata® QueryGrid™, an analytics platform that enables data processing across the enterprise.
Visit InsideAnlaysis.com for more information.
Data warehouses are time variant in the sense because they maintain both
historical and (nearly) current data. Operational databases, in contrast, contain only the most
current, up-to-date data values. Furthermore, they generally maintain this information for not
more than a year. In case of DWs, these are generally loaded from the operational databases
daily, weekly, or monthly which is then typically maintained for a long period.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Data Warehouse:
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support.
Extraction:
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.
Data Transformation:
Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse.
Data Loading:
During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.
This presenation explains basics of ETL (Extract-Transform-Load) concept in relation to such data solutions as data warehousing, data migration, or data integration. CloverETL is presented closely as an example of enterprise ETL tool. It also covers typical phases of data integration projects.
Azure Data Factory Data Wrangling with Power QueryMark Kromer
ADF has embedded Power Query in Data Factory for a code-free / data-first data wrangling experience. Use the Power Query spreadsheet-style interface in your data factory to explore and prep your data, then execute your M script at scale on ADF's Spark data flow integration runtimes.
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata
Live Webcast on May 20, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=f09e84f88e4ca6e0a9179c9a9e930b82
Traditional data warehouses have been the backbone of corporate decision making for over three decades. With the emergence of Big Data and popular technologies like open-source Apache™ Hadoop®, some analysts question the lifespan of the data warehouse and the future role it will play in enterprise information management. But it’s not practical to believe that emerging technologies provide a wholesale replacement of existing technologies and corporate investments in data management. Rather, a better approach is for new innovations and technologies to complement and build upon existing solutions.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains where tomorrow’s data warehouse fits in the information landscape. He’ll be briefed by Imad Birouty of Teradata, who will highlight the ways in which his company is evolving to meet the challenges presented by different types of data and applications. He will also tout Teradata’s recently-announced Teradata® Database 15 and Teradata® QueryGrid™, an analytics platform that enables data processing across the enterprise.
Visit InsideAnlaysis.com for more information.
In this webinar, we'll see how to use Spark to process data from various sources in R and Python and how new tools like Spark SQL and data frames make it easy to perform structured data processing.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
The general consensus among Java developers has evolved from a dogmatic strive for database independence to a much more pragmatic wish to leverage the power of the database. This session demonstrates some of the (hidden) powers of the database and how these can be utilized from Java applications using either straight JDBC or working through JPA. The Oracle database is used as example: SQL for Aggregation and Analysis, Flashback Queries for historical comparison and trends, Virtual Private Database, complex validation, PL/SQL and collections for bulk data manipulation, view and instead-of triggers for data model morphing, server push of relevant data changes, edition based redefinition for release management.
- overview of role of database in JEE architecture (and a little history on how the database is perceived through the years)
- discussion on the development of database functionality
- demonstration of some powerful database features
- description of how we leveraged these features in our JSF (RichFaces)/JPA (Hibernate) application
- demo of web application based on these features
- discussion on how to approach the database
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
What is Data Warehousing? ,
Who needs Data Warehousing? ,
Why Data Warehouse is required? ,
Types of Systems ,
OLTP
OLAP
Maintenance of Data Warehouse
Data Warehousing Life Cycle
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
2. Spark SQL:
Relational Data Processing in Spark
Challenges and Solutions
Challenges Solutions
• Perform ETL to and from
various (semi- or
unstructured) data sources
• Perform advanced analytics
(e.g. machine learning, graph
processing) that are hard to
express in relational systems.
• A DataFrame API that can
perform relational operations
on both external data sources
and Spark’s built-in RDDs.
• A highly extensible optimizer,
Catalyst, that uses features of
Scala to add composable rule,
control code gen., and define
extensions.
3. 3
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
About SQL
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
4. Spark SQL
Part of the core distribution since Spark
1.0 (April 2014)
Runs SQL / HiveQL queries, optionally
alongside or replacing existing Hive
deployments
About
5. Improvement upon Existing Art
Engine does not understand the
structure of the data in RDDs or
the semantics of user functions
limited optimization.
Can only be used to query
external data in Hive catalog
limited data sources
Can only be invoked via SQL
string from Spark error prone
Hive optimizer tailored for
MapReduce difficult to extend
9. DataFrame
• A distributed collection of rows with the same schema
(RDDs suffer from type erasure)
• Can be constructed from external data sources or
RDDs into essentially an RDD of Row objects
(SchemaRDDs as of Spark < 1.3)
• Supports relational operators (e.g. where, groupby) as
well as Spark operations.
• Evaluated lazily unmaterialized logical plan
10. Data Model
• Nested data model
• Supports both primitive SQL types (boolean, integer,
double, decimal, string, data, timestamp) and
complex types (structs, arrays, maps, and unions);
also user defined types.
• First class support for complex data types
11. DataFrame Operations
• Relational operations (select, where, join, groupBy) via a DSL
• Operators take expression objects
• Operators build up an abstract syntax tree (AST), which is then
optimized by Catalyst.
• Alternatively, register as temp SQL table and perform traditional
SQL query strings
12. Advantages over Relational Query Languages
• Holistic optimization across functions composed in
different languages.
• Control structures (e.g. if, for)
• Logical plan analyzed eagerly identify code errors
associated with data schema issues on the fly.
13. Querying Native Datasets
• Infer column names and types directly from data objects
(via reflection in Java and Scala and data sampling in
Python, which is dynamically typed)
• Native objects accessed in-place to avoid
expensive data format transformation.
• Benefits:
• Run relational operations on existing Spark programs.
• Combine RDDs with external structured data
Columnar
storage with hot
columns cached
in memory
15. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
16. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
17. • An attribute is unresolved if its type is not
known or it’s not matched to an input
table.
• To resolve attributes:
• Look up relations by name from the catalog.
• Map named attributes to the input provided
given operator’s children.
• UID for references to the same value
• Propagate and coerce types through
expressions (e.g. 1 + col)
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog
SELECT col FROM sales
18. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
19. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
20. • Applies standard rule-based
optimization (constant folding,
predicate-pushdown, projection
pruning, null propagation, boolean
expression simplification, etc)
• 800LOC
Logical Plan
Optimized
Logical Plan
Logical
Optimization
21. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
22. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
CostModel
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Optimized
Logical Plan
Physical
Planning
Physical
Plans
e.g. Pipeline projections
and filters into a single
map
23. Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
24. An Example Catalyst Transformation
1. Find filters on top of
projections.
2. Check that the filter
can be evaluated
without the result of
the project.
3. If so, switch the
operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
25. Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
26. Code Generation
• Relies on Scala’s quasiquotes to simplify code gen.
• Catalyst transforms a SQL tree into an abstract syntax tree (AST)
for Scala code to eval expr and generate code
27. : Declarative BigData Processing
Let Developers Create and Run Spark Programs Faster:
• Write less code
• Read less data
• Let the optimizer do the hard work
SQL
28. Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
Using Pig
P = load '/people' as (name, name);
G = group P by name;
R = foreach G generate … AVG G.age ;
29.
30. Extensible Input & Output
Spark’s Data Source API allows optimizations like column pruning
and filter pushdown into custom data sources.
30
{ JSON }
Built-In External
JDBC
and more…
31.
32.
33.
34. A Dataset is a strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each Dataset also has
an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions.
Transformations are the ones that produce new Datasets, and actions are the ones
that trigger computation and return results. Example transformations include map,
filter, select, and aggregate (groupBy). Example actions count, show, or writing data
out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked.
Internally, a Dataset represents a logical plan that describes the computation required
to produce the data. When an action is invoked, Spark's query optimizer optimizes the
logical plan and generates a physical plan for efficient execution in a parallel and
distributed manner. To explore the logical plan as well as optimized physical plan, use
the explain function.
To efficiently support domain-specific objects, an Encoder is required. The encoder
maps the domain specific type T to Spark's internal type system. For example, given
a class Person with two fields, name (string) and age (int), an encoder is used to tell
Spark to generate code at runtime to serialize the Person object into a binary
structure. This binary structure often has much lower memory footprint as well as are
optimized for efficiency in data processing (e.g. in a columnar format). To understand
the internal binary representation for data, use the schema function.
DataSet