The document discusses Hive's new ACID (atomicity, consistency, isolation, durability) functionality which allows for updating and deleting rows in Hive tables. Key points include Hive now supporting SQL commands like INSERT, UPDATE and DELETE; storing changes in delta files and using transaction IDs; and running minor and major compactions to consolidate delta files. Future work may include multi-statement transactions, updating/deleting in streaming ingest, Parquet support, and adding MERGE statements.
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
The document provides an overview of Apache Airflow, an open-source workflow management platform for data pipelines. It describes how Airflow allows users to programmatically author, schedule and monitor workflows or data pipelines via a GUI. It also outlines key Airflow concepts like DAGs (directed acyclic graphs), tasks, operators, sensors, XComs (cross-communication), connections, variables and executors that allow parallel task execution.
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Video: https://data-artisans.com/flink-forward-berlin/resources/monitoring-flink-with-prometheus
Live Demo Code: https://github.com/mbode/flink-prometheus-example
Prometheus is a cloud-native monitoring system prioritizing reliability and simplicity – and Flink works really well with it! This session will show you how to leverage the Flink metrics system together with Pronetheus to improve the observability of your jobs. There will be a live demo establishing how everything ties in together. The talk is aimed at people already building and running Flink jobs who would like to gain more insight into them. It is fine if you are not familiar with Prometheus yet as the basic concepts will be introduced. If you have ever wondered how you could use modern monitoring tools to be alerted in the middle of the night in case your Flink job‘s 99th percentile end-to-end latency degraded for some reason, this might just be the talk you are looking for.
Window functions enable calculations across partitions of rows in a result set. This document discusses window function syntax, types of window functions available in MySQL 8.0 like RANK(), DENSE_RANK(), ROW_NUMBER(), and provides examples of queries using window functions to analyze and summarize data in partitions.
The document discusses Hive's new ACID (atomicity, consistency, isolation, durability) functionality which allows for updating and deleting rows in Hive tables. Key points include Hive now supporting SQL commands like INSERT, UPDATE and DELETE; storing changes in delta files and using transaction IDs; and running minor and major compactions to consolidate delta files. Future work may include multi-statement transactions, updating/deleting in streaming ingest, Parquet support, and adding MERGE statements.
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
The document provides an overview of Apache Airflow, an open-source workflow management platform for data pipelines. It describes how Airflow allows users to programmatically author, schedule and monitor workflows or data pipelines via a GUI. It also outlines key Airflow concepts like DAGs (directed acyclic graphs), tasks, operators, sensors, XComs (cross-communication), connections, variables and executors that allow parallel task execution.
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Video: https://data-artisans.com/flink-forward-berlin/resources/monitoring-flink-with-prometheus
Live Demo Code: https://github.com/mbode/flink-prometheus-example
Prometheus is a cloud-native monitoring system prioritizing reliability and simplicity – and Flink works really well with it! This session will show you how to leverage the Flink metrics system together with Pronetheus to improve the observability of your jobs. There will be a live demo establishing how everything ties in together. The talk is aimed at people already building and running Flink jobs who would like to gain more insight into them. It is fine if you are not familiar with Prometheus yet as the basic concepts will be introduced. If you have ever wondered how you could use modern monitoring tools to be alerted in the middle of the night in case your Flink job‘s 99th percentile end-to-end latency degraded for some reason, this might just be the talk you are looking for.
Window functions enable calculations across partitions of rows in a result set. This document discusses window function syntax, types of window functions available in MySQL 8.0 like RANK(), DENSE_RANK(), ROW_NUMBER(), and provides examples of queries using window functions to analyze and summarize data in partitions.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
This document provides an overview of building data pipelines using Apache Airflow. It discusses what a data pipeline is, common components of data pipelines like data ingestion and processing, and issues with traditional data flows. It then introduces Apache Airflow, describing its features like being fault tolerant and supporting Python code. The core components of Airflow including the web server, scheduler, executor, and worker processes are explained. Key concepts like DAGs, operators, tasks, and workflows are defined. Finally, it demonstrates Airflow through an example DAG that extracts and cleanses tweets.
Airflow: Save Tons of Money by Using Deferrable OperatorsKaxil Naik
This talk is from Open Source Summit 2022
Apache Airflow 2.2 introduced the concept of Deferrable Tasks that uses Python's async feature.
All the Airflow sensors and poll-based operators can be hugely optimized to save tons of money by freeing up worker slots when polling.
This session will cover the following topics: - Introduction to the concept of deferrable operator
- Why do we need them?
- When to use them?
- How does it work?
- Writing Custom deferrable operators & Sensors
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.
Airflow is a platform for authoring, scheduling, and monitoring workflows or data pipelines. It uses a directed acyclic graph (DAG) to define dependencies between tasks and schedule their execution. The UI provides dashboards to monitor task status and view workflow histories. Hands-on exercises demonstrate installing Airflow and creating sample DAGs.
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Parallel Execution With Oracle Database 12c - MasterclassIvica Arsov
This document provides an overview of parallel execution in Oracle Database 12c. It discusses parallel execution basics like degree of parallelism (DOP), producer-consumer model, and data flow operations (DFOs). It also covers parallel execution administration topics such as ways to enable parallel query, DML, and DDL. Reasons why parallel DML is not enabled by default are also mentioned. The document then dives deeper into parallel execution concepts like distribution methods, auto DOP, in-memory parallel execution, and tracing parallel operations.
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
This document discusses enhancements to the Spark SQL optimizer through improved statistics collection and cost-based optimization rules. It describes collecting table and column statistics from Hive metastore and developing 1D and 2D histograms. New rules estimate operator costs based on output rows and size. Join order, filter statistics, and handling unique columns are discussed. Future work includes faster histogram collection, expression statistics, and continuous feedback optimization.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
This document provides an introduction and overview of PostgreSQL, including its history, features, installation, usage and SQL capabilities. It describes how to create and manipulate databases, tables, views, and how to insert, query, update and delete data. It also covers transaction management, functions, constraints and other advanced topics.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
This document discusses combining Apache Spark and MongoDB for real-time analytics. It provides an overview of MongoDB's native analytics capabilities including querying, data aggregation, and indexing. It then discusses how Apache Spark can extend these capabilities by providing additional analytics functions like machine learning, SQL queries, and streaming. Combining Spark and MongoDB allows organizations to perform real-time analytics on operational data without needing separate analytics infrastructure.
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
The document discusses key concepts related to the Pig analytics framework. It covers topics like why Pig was developed, what Pig is, comparisons of Pig to MapReduce and Hive, Pig architecture involving Pig Latin scripts, a runtime engine, and execution via a Grunt shell or Pig server, how Pig works by loading data and executing Pig Latin scripts, Pig's data model using atoms and tuples, and features of Pig like its ability to process structured, semi-structured, and unstructured data without requiring complex coding.
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
This document provides an overview of spark-timeseries, an open source time series library for Apache Spark. It discusses the library's design choices around representing multivariate time series data, partitioning time series data for distributed processing, and handling operations like lagging and differencing on irregular time series data. It also presents examples of using the library to test for stationarity, generate lagged features, and perform Holt-Winters forecasting on seasonal passenger data.
Postgres expert, Bruce Momjian, as he discusses common table expressions (CTEs) and the ability to allow queries to be more imperative, allowing looping and processing hierarchical structures that are normally associated only with imperative languages.
Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.
Import and Export Excel Data using openxlsx in R StudioRupak Roy
This document discusses using the openxlsx package in R to import and export Excel files without relying on Java. It covers functions for loading and reading Excel files, adding and writing data to worksheets, and saving workbooks. Functions covered include loadWorkbook(), readWorkbook(), addWorksheet(), writeData(), and saveWorkbook(). The document provides code examples for using each function to load, manipulate, and save Excel data in R.
Openpyxl is a Python module to deal with Excel files without involving MS Excel application software. It is used extensively in different operations from data copying to data mining and data analysis by computer operators to data analysts and data scientists. openpyxl is the most used module in python to handle excel files. If you have to read data from excel, or you want to write data or draw some charts, accessing sheets, renaming sheets, adding or deleting sheets, formatting and styling in sheets or any other task, openpyxl will do the job for you.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
This document provides an overview of building data pipelines using Apache Airflow. It discusses what a data pipeline is, common components of data pipelines like data ingestion and processing, and issues with traditional data flows. It then introduces Apache Airflow, describing its features like being fault tolerant and supporting Python code. The core components of Airflow including the web server, scheduler, executor, and worker processes are explained. Key concepts like DAGs, operators, tasks, and workflows are defined. Finally, it demonstrates Airflow through an example DAG that extracts and cleanses tweets.
Airflow: Save Tons of Money by Using Deferrable OperatorsKaxil Naik
This talk is from Open Source Summit 2022
Apache Airflow 2.2 introduced the concept of Deferrable Tasks that uses Python's async feature.
All the Airflow sensors and poll-based operators can be hugely optimized to save tons of money by freeing up worker slots when polling.
This session will cover the following topics: - Introduction to the concept of deferrable operator
- Why do we need them?
- When to use them?
- How does it work?
- Writing Custom deferrable operators & Sensors
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.
Airflow is a platform for authoring, scheduling, and monitoring workflows or data pipelines. It uses a directed acyclic graph (DAG) to define dependencies between tasks and schedule their execution. The UI provides dashboards to monitor task status and view workflow histories. Hands-on exercises demonstrate installing Airflow and creating sample DAGs.
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Parallel Execution With Oracle Database 12c - MasterclassIvica Arsov
This document provides an overview of parallel execution in Oracle Database 12c. It discusses parallel execution basics like degree of parallelism (DOP), producer-consumer model, and data flow operations (DFOs). It also covers parallel execution administration topics such as ways to enable parallel query, DML, and DDL. Reasons why parallel DML is not enabled by default are also mentioned. The document then dives deeper into parallel execution concepts like distribution methods, auto DOP, in-memory parallel execution, and tracing parallel operations.
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
This document discusses enhancements to the Spark SQL optimizer through improved statistics collection and cost-based optimization rules. It describes collecting table and column statistics from Hive metastore and developing 1D and 2D histograms. New rules estimate operator costs based on output rows and size. Join order, filter statistics, and handling unique columns are discussed. Future work includes faster histogram collection, expression statistics, and continuous feedback optimization.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
This document provides an introduction and overview of PostgreSQL, including its history, features, installation, usage and SQL capabilities. It describes how to create and manipulate databases, tables, views, and how to insert, query, update and delete data. It also covers transaction management, functions, constraints and other advanced topics.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
This document discusses combining Apache Spark and MongoDB for real-time analytics. It provides an overview of MongoDB's native analytics capabilities including querying, data aggregation, and indexing. It then discusses how Apache Spark can extend these capabilities by providing additional analytics functions like machine learning, SQL queries, and streaming. Combining Spark and MongoDB allows organizations to perform real-time analytics on operational data without needing separate analytics infrastructure.
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
The document discusses key concepts related to the Pig analytics framework. It covers topics like why Pig was developed, what Pig is, comparisons of Pig to MapReduce and Hive, Pig architecture involving Pig Latin scripts, a runtime engine, and execution via a Grunt shell or Pig server, how Pig works by loading data and executing Pig Latin scripts, Pig's data model using atoms and tuples, and features of Pig like its ability to process structured, semi-structured, and unstructured data without requiring complex coding.
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
This document provides an overview of spark-timeseries, an open source time series library for Apache Spark. It discusses the library's design choices around representing multivariate time series data, partitioning time series data for distributed processing, and handling operations like lagging and differencing on irregular time series data. It also presents examples of using the library to test for stationarity, generate lagged features, and perform Holt-Winters forecasting on seasonal passenger data.
Postgres expert, Bruce Momjian, as he discusses common table expressions (CTEs) and the ability to allow queries to be more imperative, allowing looping and processing hierarchical structures that are normally associated only with imperative languages.
Garbage in, garbage out - we have all heard about the importance of data quality. Having high quality data is essential for all types of use cases, whether it is reporting, anomaly detection, or for avoiding bias in machine learning applications. But where does high quality data come from? How can one assess data quality, improve quality if necessary, and prevent bad quality from slipping in? Obtaining good data quality involves several engineering challenges. In this presentation, we will go through tools and strategies that help us measure, monitor, and improve data quality. We will enumerate factors that can cause data collection and data processing to cause data quality issues, and we will show how to use engineering to detect and mitigate data quality problems.
Import and Export Excel Data using openxlsx in R StudioRupak Roy
This document discusses using the openxlsx package in R to import and export Excel files without relying on Java. It covers functions for loading and reading Excel files, adding and writing data to worksheets, and saving workbooks. Functions covered include loadWorkbook(), readWorkbook(), addWorksheet(), writeData(), and saveWorkbook(). The document provides code examples for using each function to load, manipulate, and save Excel data in R.
Openpyxl is a Python module to deal with Excel files without involving MS Excel application software. It is used extensively in different operations from data copying to data mining and data analysis by computer operators to data analysts and data scientists. openpyxl is the most used module in python to handle excel files. If you have to read data from excel, or you want to write data or draw some charts, accessing sheets, renaming sheets, adding or deleting sheets, formatting and styling in sheets or any other task, openpyxl will do the job for you.
In java , I want you to implement a Data Structure known as a Doubly.pdfaromalcom
In java , I want you to implement a Data Structure known as a Doubly-Ended-Queue. it is a
“fair” data structure in that it implements a FIFO (First In First Out ) behavior. As such, it is
often used to implement various wait lists in computer systems. For example, jobs waiting to use
the CPU, jobs waiting for a printer, jobs waiting to be placed into RAM for execution. In short,
whenever we want a fair strategy for waiting we use queues.
A DEQUE (Doubly-ended-queue) is a related data structure. Although similar to a Queue, it
differs in that it allows for insertions AND deletions from either end of the list (both the front
and the rear).
Your implementation MUST use a doubly-linked-list implementation. You may not use a static
array implementation.
Thus, a Deque is a List but it is one which only concerns itself with the first and last positions for
any insertion or deletion. The 6 operations supported are :
public void insertFront( int item ) - insert the given item (as a node) into the first position of the
Deque.
public void insertRear( int item ) - insert the given item (as a node) into the last position of the
Deque.
public int deleteFront( ) - delete and return the element stored in the first node of the Deque.
public int deletRear( ) – delete and return the element stored in the last node of the Deque.
public boolean isempty( ) - returns true if the Deque is currently empty or false if it is not.
public void printDeque( ) - print the integers from the list, one per line, from the first element
through to the last in order.
Classes
Your program must implement the following 3 classes. public class dequeDriver
This class will contain your program’s main method. It will need to declare a deque object and
process input as indicated below.
Your program should prompt the user for the path of an input file. It should open the file for
input and process it line by line. Each line of the input file will have one of the following forms.
PR
IF
IR
DF
DR
The meanings of each input is as follows:
PR - print the current contents of the deque from front to rear using the printDeque( ) method of
the deque object.
IF - insert the given int value into the front of the deque.
IR - insert the given int value into the rear of the deque.
DF - delete the front value from the deque.
DR – delete the rear element of the deque.
Below is an example input file that your program should be able to process.
PR
IF 4
IF 5
IF 6
IR 7
PR
DR
PR
DF
PR
The output for the input file shown above is :
EMPTY DEQUE
----- Front -----
6
5
4
7
----- Rear -----
----- Front -----
6
5
4
----- Rear -----
----- Front -----
5
4
----- Rear -----
public class dequeNode
This class will implement the linked nodes that will be used to implement the deque itself.
It should have the following protected data members.
protected dequeNode next; // next pointer to next node
protected dequeNode prev; // previous pointer to previous node
protected int val; // the integer value stored within the dequeNod.
The document discusses how to read Excel files in Java using the Apache POI API. It explains that Java does not have built-in support for Excel files, but the Apache POI library can be used to read and write Excel (.xls and .xlsx) files. It provides code samples to read data from Excel files, including reading specific cell values. It also demonstrates how to handle Excel files, extract workbook, sheet, row and cell objects, and get cell values and types using the Apache POI classes.
This document contains information about Lex, Yacc, Flex, and Bison. It provides definitions and descriptions of each tool. Lex is a lexical analyzer generator that reads input specifying a lexical analyzer and outputs C code implementing a lexer. Yacc is a parser generator that takes a grammar description and snippets of C code as input and outputs a shift-reduce parser in C. Flex is a tool similar to Lex for generating scanners based on regular expressions. Bison is compatible with Yacc and can be used to develop language parsers.
This document provides instructions for using WinCOM (Component Object Model) to export data from an Arduino motor controller application to an Excel spreadsheet. It discusses:
1) Using WinCOM to export data from the application to cells in an Excel spreadsheet by creating an Excel object and filling cells.
2) Controlling a DC motor with an Arduino board and sending motor data back to be exported to an Excel file.
3) Configuring COM port settings to communicate between the Arduino and computer for sending and receiving data.
This document discusses linker scripts and their purpose. It explains that linker scripts control how sections from input files are mapped into the output file and define the memory layout. It describes several keywords used in linker scripts like ENTRY, OUTPUT_FORMAT, MEMORY, and SECTIONS. The SECTIONS keyword is used to describe the memory layout by assigning sections like .text, .data, and .bss to addresses.
This document discusses various topics related to enterprise resource planning (ERP) systems and technologies. It defines ERP as business process management software that integrates applications to manage business functions. It describes the typical lifecycle of an ERP implementation project, including pre-evaluation, evaluation, project planning, gap analysis, reengineering, training, testing, and post-implementation. It also discusses ERP-related technologies like business intelligence, supply chain management, and customer relationship management.
The document provides an overview of LaTeX and discusses:
- LaTeX is a typesetting system that incorporates a macro processor to typeset documents.
- LaTeX uses markup tags and commands to specify formatting rather than using a graphical user interface.
- The document discusses LaTeX document classes, packages, file types, basic commands, environments, cross referencing, fonts, graphics, and tables. It also provides an overview of LaTeX editors like TeXstudio and distributions like MiKTeX.
MATLAB stands for Matrix Laboratory. MATLAB was written originally
to provide easy access to matrix software developed by the LINPACK (linear system package) and matlab 2012a manual pdf
Linked List Static and Dynamic Memory AllocationProf Ansari
Static variables are declared and named while writing the program. (Space for them exists as long as the program, in which they are declared, is running.) Static variables cannot be created or destroyed during execution of the program in which they are declared.
Dynamic variables are created (and may be destroyed) during program execution since dynamic variables do not exist while the program is compiled, but only when it is run, they cannot be assigned names while it is being written. The only way to access dynamic variables is by using pointers. Once it is created, however, a dynamic variable does contain data and must have a type like any other variable. If a dynamic variable is created in a function, then it can continue to exist even after the function terminates.
Linked Linear List
We saw in previous chapters how static representation of linear ordered list through Array leads to wastage of memory and in some cases overflows. Now we don't want to assign memory to any linear list in advance instead we want to allocate memory to elements as they are inserted in list. This requires Dynamic Allocation of memory and it can be achieved by using malloc() or calloc() function.
But memory assigned to elements will not be contiguous, which is a requirement for linear ordered list, and was provided by array representation. How we could achieve this?
1. Include the algorithm2e package in the preamble
2. Define keywords like Initialize, Function, Input using \SetKwProg and \SetKwInOut
3. Begin the algorithm environment and add a caption
4. Use the defined keywords and other instructions to describe the algorithm steps
5. End the algorithm environment
This allows algorithms to be clearly presented with customized keywords in a formatted manner. The algorithm2e package is very useful for presenting pseudocode in LaTeX documents.
The document provides an overview of yacc (Yet Another Compiler Compiler), which is a tool that parses a stream of tokens according to a user-specified grammar. It describes the structure of a yacc file, which includes definitions, rules, and code sections. It also discusses how yacc interacts with lex to generate tokens, and how values can be returned from lex to yacc using the yylval variable. An example calculator program is provided to demonstrate how yacc can be used to parse arithmetic expressions by defining grammar rules and associating actions with parsing steps.
JXL is the library of JExcel API, which is an open source Java API that performs the task to dynamically read, write, and modify Excel spreadsheets.
We can use its powerful features to build an automated testing framework using Selenium Web Drivers. The JXL works as a data provider where multiple sets of data is required as input. Moreover, users can read and write information using external excel files. The JXL also helps create custom reports where users have all authority to design reports as per their need.
Listen to this webinar to explore JXL with examples.
SQL Loader is a utility used to load data from flat files into Oracle tables. It can load data from other databases by first converting the data to a flat file format. The document provides steps for using SQL Loader including writing a control file to describe the data file and load options, creating Oracle tables, and running SQL Loader to import the data. SQL Loader can load data into multiple tables at once using WHEN conditions and supports both conventional and direct path loading methods.
The SQL Loader utility can be used to load external data from files into Oracle tables. It uses a control file to describe the loading process. The control file specifies the data file, table, column definitions, field delimiters and other loading options. SQL Loader then loads the data according to the specifications in the control file. Logs and error files can be generated to monitor and debug the load process. Data can be loaded into single or multiple tables based on conditions specified in the control file.
This document summarizes two methods for reading Excel files in R. The first method uses the gdata package to read Excel files with the read.xls function after setting the working directory. The second method uses the xlsx package to read a specific worksheet from an Excel file stored in the package folder with read.xlsx after loading the package. Both methods allow importing Excel data into R for analysis.
CSC8503 Principles of Programming Languages Semester 1, 2015.docxfaithxdunce63732
CSC8503 Principles of Programming Languages Semester 1, 2015
Assignment 2
Due Date: 11:55pm AEST (13:55 UTC/GMT) Monday 10 May 2015
Weighting: 20%
Total marks: 20
Please submit this assignment using the assignment submission facility on the course
Study Desk. Submit a single file, either a ZIP or TAR archive. The archive
should contain (1) for Part A, a Haskell source file containing the function definitions,
and (2) for Part B, your version of all the files that are in the SPL distribution that you
downloaded.
Just add the Haskell file (call it say ass2.hs) to your collection of SPL files and zip or
tar them into an archive that you submit.
Part A – Haskell – 12 marks
Complete the following Haskell function definitions. Unless stated otherwise do not use library
functions that are not in the Haskell standard prelude. This constraint is so that you
gain practice in simple Haskell recursive programming. The Haskell 2010 standard prelude
definition is available at
https://www.haskell.org/onlinereport/haskell2010/haskellch9.html
Place all definitions in a single file. Submit just this text file electronically as
directed on the course Study Desk page. Use the specified function name as your
code will be tested by a Haskell function expecting that function name.
The testing program may use many more test cases than the ones shown in the specification.
So, please test your functions extensively to ensure that you maximise your marks.
1. [2 marks]
Write the function insertAt :: Int -> a -> [a] -> [a].
insertAt n x xs will insert the element x into the list xs at position n items from the
beginning of xs. In other words, skip n items in xs, then insert the new element.
You can assume that n will be a non-negative number. If n is greater than the length of
the list xs then add it to the end of the list.
For example
insertAt 3 ’-’ "abcde" ⇒ "abc-de"
insertAt 2 100 [1..5] ⇒ [1,2,100,3,4,5]
Hint: Use standard prelude functions ++ and splitAt.
2. [2 marks] Write a function uniq :: Eq a => [a] -> [a] that removes duplicate entries
from a sorted (in ascending order) list. The resulting list should be sorted, and no value
in it can appear elsewhere in the list.
For example:
1
https://www.haskell.org/onlinereport/haskell2010/haskellch9.html
uniq [1,2,2] ⇒ [1,2]
uniq [1,2,3] ⇒ [1,2,3]
3. [1 mark] Write a function
join :: Eq a => [(a,b)] -> [(a,c)] -> [(a,b,c)].
join takes two lists of pairs, and returns a single list of triples. A triple is generated only
when there exists a member of both argument lists that have the same first element. The
list elements are not sorted. This is the same semantics as the relational algebra natural
join operation.
For example:
join [(2,"S"),(1,"J")] [(2,True),(3,False)]
⇒ [(2,"S",True)]
join [(2,"S"),(1,"J")] [(2,1),(2,2),(3,4)]
⇒ [(2,"S",1),(2,"S",2)]
Hint: use list a comprehension.
4. [1 mark] This question extends the join function from question 3. Write the function
ljoin :: Eq a => [(a,b)] -> [(a,c.
Similar to Import and Export Excel files using XLConnect in R Studio (20)
Hierarchical Clustering - Text Mining/NLPRupak Roy
Documented Hierarchical clustering using Hclust for text mining, natural language processing.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Clustering K means and Hierarchical - NLPRupak Roy
Classify to cluster the natural language processing via K means, Hierarchical and more.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Network Analysis using 3D interactive plots along with their steps for implementation.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Widely accepted steps for sentiment analysis.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Detailed Pattern Search using regular expressions using grepl, grep, grepexpr and Replace with sub, gsub and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Bundled with the documentation to the introduction of Apache Hbase to the configuration.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Understand and implement the terminology of why partitioning the table is important and the Hive Query Language (HQL)
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Installing Apache Hive, internal and external table, import-export Rupak Roy
Perform Hive installation with internal and external table import-export and much more
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Well illustrated with definitions of Apache Hive with its architecture workflows plus with the types of data available for Apache Hive
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Automate the complete big data process from import to export data from HDFS to RDBMS like sql with apache sqoop
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
Familiar with scoop advanced functions like import with append and last modified mode.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Get acquainted with the differences in scoop, the added advantages with hands-on implementation
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Get acquainted with a distributed, reliable tool/service for collecting a large amount of streaming data to centralized storage with their architecture.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
take care!
Enhance analysis with detailed examples of Relational Operators - II includes Foreash, Filter, Join, Co-Group, Union and much more.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
Passing Parameters using File and Command LineRupak Roy
Explore well versed other functions, flatten operator and other available options to pass parameters
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
Get to know the implementation of apache Pig relational operators like order, limit, distinct, groupby.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
2. Working with excel files
R also comes with different packages to support read, write and
manipulate excel files directly without converting them in other
formats.
Some of the common packages used today are
ØXLConnect - uses rjava: a low level R to java interface
ØOpen.xlsx - uses C++ dependencies instead of rjava(java)
ØGdata - with pearl dependencies
ØReadXL, XLSX, readr packages.
Let’s learn each of them in detail.
Rupak Roy
3. XLCONNECT
ØXLCONNECT: is a connector for R that provides comprehensive
functionality to read, write and format Excel data.
ØImport functions include:
loadWorkbook()
readWorkbook()
readWorkbookFromFile()
ØExport functions inlude:
createSheet()
writeWorkSheet()
saveWorkbook()
Rupak Roy
4. XLCONNECT:loadWorkbook()
loadWorkbook(): Loads or create a Microsoft excel workbook in R
for further manipulation.
>loadWorkbook(filename, create = FALSE, password = NULL)
Where
filename = excel workbook to be loaded
create = Specifies if the file should be created if it does not already
exist (default is FALSE)
password = Password to use when opening password protected files.
The default NULL means no password is being used. This
argument is ignored when creating new files using create = TRUE.
5. XLCONNECT:loadWorkbook()
#install the XLConnect package
>install.packages(“XLConnect”, dependencies = TRUE)
#load the functions from XLConnect package.
>library(XLConnect)
#load the excel file
>xlsx_data<- loadWorkbook("sample.xlsx")
>class(xlsx_data)
To know more about the features of loadWorkbook() use
>?XLConnect::loadWorkbook
Rupak Roy
6. XLCONNECT:readWorksheet ()
readWorksheet(): Reads data from worksheets of a loadWorkbook.
>worksheet1<-readWorksheet(object, sheet, startCol, endRow, Header = T,….)
Where
object = name of the workbook from loadWorkbook
sheet = sheet name of the workbook
startCol = The index of the first column to read from. Defaults to 0 meaning that
the start column is determined automatically..
endRow = The index of the last row to read from. Defaults to 0 meaning that the
end row is determined automatically.
startRow = The index of the first row to read from. Defaults to 0 meaning that
the start row is determined automatically.
endCol = The index of the last column to read from. Defaults to 0 meaning that
the end column is determined automatically..
7. XLCONNECT:readWorksheet()
#install the XLConnect package
>install.packages(“XLConnect”, dependencies = TRUE)
#load the functions from XLConnect package.
>library(XLConnect)
#Read the 1st excel sheet from xlsx_data R object i.e. sample.xlsx file.
>excel_data<- readWorksheet (xlsx_data, “store”, header = T)
>View(excel_data)
#Read the 2nd excel sheet from xlsx_data R object i.e. sample.xlsx file.
>excel_data2<- readWorksheet (xlsx_data,“bike_sharing_program”, endRow = 10,
startCol =3, header = T)
>View(excel_data2)
To know more about the features of loadWorkbook() use
>?XLConnect::readWorksheet
8. XLCONNECT:readWorksheetFromFile()
readWorksheetFromFile(): Reads data from a worksheet directly from a
physical excel file.
>worksheet3<-readWorksheetFromFile(file, sheet, startCol, endRow, Header =
T ……. Same Arguments passed to readWorksheet)
Where
file = name of the excel file to be read
sheet = sheet name of workbook
startCol = The index of the first column to read from. Defaults to 0 meaning that
the start column is determined automatically..
endRow = The index of the last row to read from. Defaults to 0 meaning that the
end row is determined automatically.
startRow = The index of the first row to read from. Defaults to 0 meaning that
the start row is determined automatically.
endCol = The index of the last column to read from. Defaults to 0 meaning that
the end column is determined automatically..
9. XLCONNECT:readWorksheetFromFile()
#install the XLConnect package
>install.packages(“XLConnect”, dependencies = TRUE)
#load the functions from XLConnect package.
>library(XLConnect)
#Read the excel sheet directly from an excel file
>excel_data3<- readWorksheetFromFile (“sample.xlsx”, “store”, header = T)
>View(excel_data3)
XLConnect::readWorksheetFromFile - the only difference between
readWorksheet and readWorksheetFromFile is that in readWorksheet() the
excel file have to be first loaded in R directory using loadWorkbook() in order
to view the data but in readWorksheetFromFile() it reads the excel sheet
directly from a physical file.
To know more about the features of readWorksheetFromFile() use
>?XLConnect::readWorksheetFromFile
10. XLCONNECT:createSheet()
createSheet(): Creates new worksheet in a workbook loaded via
loadWorkbook()
>createSheet (object, name)
Where
object = name of the workbook to use
name = name of the sheet to create
Rupak Roy
11. XLCONNECT:createSheet()
#install the XLConnect package
>install.packages(“XLConnect”, dependencies = TRUE)
#load the functions from XLConnect package.
>library(XLConnect)
#Create the a new empty excel sheet in the workbook
>createSheet(xlsx_data, “new_sheet”)
XLConnect::createSheet() - Creates a worksheet with the specified name if it
does not already exist. The naming of worksheets needs to be in line with
Excel's convention, otherwise an exception will be thrown. For example,
worksheet names cannot be longer than 31 characters.
To know more about the features of createSheet() use
>?XLConnect::createSheet
Rupak Roy
12. XLCONNECT:writeWorksheet()
writeWorksheet(): Creates new worksheet in a workbook loaded via
loadWorkbook()
>writeWorksheet (object, data, sheet=“sheet_name”)
Where
object = name of the worksheet to read
data = data to be written
sheet = The name or index of the sheet to write to
startRow = Index of the first row to write to. The default is startRow = 1
startCol = Index of the first column to write to. The default is startCol = 1
header = Specifies if the column names should be written. Default (TRUE).
13. XLCONNECT:writeWorksheet()
#install the XLConnect package
>install.packages(“XLConnect”, dependencies = TRUE)
#load the functions from XLConnect package.
>library(XLConnect)
#Write/Copy a workbook sheet directly to a new workbook sheet
>writeWorkSheet(xlsx_data, bike_sharing_program, “new_sheet”)
XLConnect::writeWorksheet() - Writes data to the worksheet specified
by sheet. Data here is assumed to be a data.frame and is coerced to one if this
is not already the case. StartRow and startCol define the top left corner of the
data region to be written.
To know more about the features of writeWorksheet() use
>?XLConnect::writeWorksheet
Rupak Roy
14. XLCONNECT:saveWorkbook()
saveWorkbook(): Saves a workbook to the corresponding Excel file. This
method actually writes the workbook object to disk.
>saveWorkbook (object,file)
Where
object = the workbook to save
file = The file to which it will save the workbook ("save as")
>saveWorkbook(xlsx, “document1.xlsx”);
To know more about the saveWorkbook function use
?XLConnect::saveWorkbook
Rupak Roy