(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
CCF #1: Taking the reins of your data with Hiera 5davidmogar
First session of Casual Config Fridays including an introduction to Hiera and a brief overview to the features of its latest version from the point of view of a service manager at CERN.
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
CCF #1: Taking the reins of your data with Hiera 5davidmogar
First session of Casual Config Fridays including an introduction to Hiera and a brief overview to the features of its latest version from the point of view of a service manager at CERN.
There are some things in Stacki that you can only do with Remove commands. This tutorial takes you over the most common remove commands and offers an overview of how they work.
Download Stacki: www.stacki.com
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
Familiar with scoop advanced functions like import with append and last modified mode.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Con...Paulo Jesus
The occurrence of failures and crashes can compromise the high availability of your database system, affecting your revenue and reputation. Therefore, it is fundamental to minimize downtime and have an efficient strategy for crash recovery. Replication and failover are commonly applied to deal with those situations, but what if failures occur during the recovery process? This can really be a headache, so it is better to be prepared. This session discusses the development of fault-tolerant failover solutions using the MySQL utilities library and covers the following topics:
• Issues during failover/switchover
• Fault-tolerant failover solutions
• Using the MySQL utilities library to provide your own solution
There are some things in Stacki that you can only do with Remove commands. This tutorial takes you over the most common remove commands and offers an overview of how they work.
Download Stacki: www.stacki.com
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
Familiar with scoop advanced functions like import with append and last modified mode.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Con...Paulo Jesus
The occurrence of failures and crashes can compromise the high availability of your database system, affecting your revenue and reputation. Therefore, it is fundamental to minimize downtime and have an efficient strategy for crash recovery. Replication and failover are commonly applied to deal with those situations, but what if failures occur during the recovery process? This can really be a headache, so it is better to be prepared. This session discusses the development of fault-tolerant failover solutions using the MySQL utilities library and covers the following topics:
• Issues during failover/switchover
• Fault-tolerant failover solutions
• Using the MySQL utilities library to provide your own solution
This hands-on R course will guide users through a variety of programming functions in the open-source statistical software program, R. Topics covered include indexing, loops, conditional branching, S3 classes, and debugging. Full workshop materials available from http://projects.iq.harvard.edu/rtc/r-prog
URI와 RDF에 기반한 분산 데이터의 연계는 기본적으로 그래프 구조를 가지게 된다. 최근에는 RDF 그래프의 생성과 질의 뿐 아니라, 그래프 마이닝에 대해 많은 관심과연구가 진행되고 있다. 본 발표에서는 소셜 네트워크를 RDF 그래프로 표현하고, 이를 마이닝, 추론 함으로, 어떻게 대용량 소셜 네트워크를 효과적 분석할 수있는지 설명한다. 특히, 휴대전화 통화에 기반한 모바일 소셜 네트워크와 e-mail 지식 네트워크 분석이 시맨틱 웹 표준하에 어떻게 구현 가능한지 그 사례 소개와시연을 보인다.
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
Introducción (en Español) al framework de procesamiento distribuido en memoria Apache Spark. Elementos básicos de Spark, RDD, incluye demo de las librerías SparkSQL y Spark Streaming
Presentado en www.nardoz.com
http://arjon.es/2014/08/14/introduccion-a-apache-spark-en-espanol/
Docker networking basics & coupling with Software Defined NetworksAdrien Blind
This presentation reminds Docker networking, exposes Software Defined Network basic paradigms, and then proposes a mixed-up implementation taking benefits of a coupled use of these two technologies. Implementation model proposed could be a good starting point to create multi-tenant PaaS platforms.
As a bonus, OpenStack Neutron internal design is presented.
You can also have a look on our previous presentation related to enterprise patterns for Docker:
http://fr.slideshare.net/ArnaudMAZIN/docker-meetup-paris-enterprise-docker
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Web pages can get very complex and slow. In this talk, I share how we solve some of these problems at LinkedIn by leveraging composition and streaming in the Play Framework. This was my keynote for Ping Conference 2014 ( http://www.ping-conf.com/ ): the video is on ustream ( http://www.ustream.tv/recorded/42801129 ) and the sample code is on github ( https://github.com/brikis98/ping-play ).
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Apache Flink Overview at SF Spark and FriendsStephan Ewen
Introductory presentation for Apache Flink, with bias towards streaming data analysis features in Flink. Shown at the San Francisco Spark and Friends Meetup
ChinaNetCloud - The Zabbix Database - Zabbix Conference 2014ChinaNetCloud
Overview of the Zabbix monitoring system database and how to use or customize it for reporting and integration.
Originally given at Zabbix Global Conference in Riga, Latvia in Sept, 2014
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
1. RHive tutorial - basic functions
This tutorial explains how to load RHive library and use basic Functions for
RHive.
Loading RHive
Load RHive with the method used when using any R package. Load RHive
like below:
library(RHive)
But before loading RHive, you must not forget to configure HADOOP_HOME
and HIVE_HOME environment
And if they are not set then you can temporarily set them before loading the
library, like as follows.
HADOOP_HOME is the home directory where Hadoop is installed and
HIVE_HOME is the home directory where Hive is installed.
Consult RHive tutorial - RHive installation and setting for details on
environment variables.
Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
library(RHive)
rhive.init
rhive.init is a procedure that internally initializes and if, before loading RHive,
environment variables were calibrated accurately then they will automatically
run.
But if these environment variable were not configured while RHive was loaded
via library(RHIve) then the following error message will result.
rhive.connect()
Error
in
.jcall("java/lang/Class",
"Ljava/lang/Class;",
"forName",
cl,
:
No
running
JVM
detected.
Maybe
.jinit()
would
help.
Error
in
.jfindClass(as.character(class))
:
No
running
JVM
detected.
Maybe
.jinit()
would
help.
2. For this case then designate HADOOP_HOME and HADOOP_HOME as
shown below or exit R then configure environment variables and restart R.
Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
rhive.init()
Or,
close
R
export
HIVE_HOME="/service/hive-‐0.7.1"
export
HADOOP_HOME="/service/hadoop-‐0.20.203.0"
open
R
rhive.connect
All Functions of RHive will only work after having connected to Hive server.
If before using other Functions of RHive, you have not established a
connection by using the rhive.connect Function,
All RHive Functions will malfunction and produce the following errors when
running.
Error
in
.jcast(hiveclient[[1]],
new.class
=
"org/apache/hadoop/hive/service/HiveClient",
:
cannot
cast
anything
but
Java
objects
Establishing a connection with Hive server to use RHive is simple with the
following:
rhive.connect()
The example above can additionally assign a few more things.
rhiveConnection
<-‐
rhive.connect("10.1.1.1")
In the case the user’s Hive server is installed to a server other than the one
with RHive installed, and has to remotely connect,
a connection can be made by handing arguments over to the rhive.connect
Function.
3. Then if you have multiple Hadoop and Hive clusters, then after making the
right configurations to have RHive activated, and you want to switch between
the Hives then
just like using DB client such as MySQL, you should make connections and
hand it over to the Functions via arguments to explicitly select connection.
rhive.query
If the user has experience in using Hive, then he/she probably knows that
Hive supports SQL syntax to handle the data for Map/Reduce and HDFS.
rhive.query gives SQL to Hive and receives results from Hive.
Users who know SQL syntax will find this a frequently encountered example.
rhive.query("SELECT
*
FROM
usarrests")
If you run the example above then you will see the contents of a table named
‘usarrests’ printed on the screen.
Or, on top of printing the returned result on the screen, you can also assign to
a data.frame object those results.
resultDF
<-‐
rhive.query("SELECT
*
FROM
usarrests")
A thing to beware of is if the data returned from rhive.query is bigger than the
RHive server’s memory or laptop’s, exhaustion of available memory will
induce an error message.
That is why you must not receive and put into object any data of such size.
It is better to first create a temporary table and then put the results of the SQL
to the temporary table.
You can do it as the following.
rhive.query("
CREATE
TABLE
new_usarrests
(
rowname
string,
murder
double,
assault
int,
urbanpop
int,
rape
double
)")
4.
rhive.query("INSERT
OVERWRITE
TABLE
new_usarrests
SELECT
*
FROM
usarrest")
Consult a Hive document for a detailed account of how to use Hive SQL.
rhive.close
If you have finished using Hive and do not wish to use RHive Functions any
longer, you can use the rhive.close Function to terminate the connection.
rhive.close()
Alternatively, you can assign a specific connection to close it.
conn
<-‐
rhive.connect()
rhive.close(conn)
rhive.list.tables
The rhive.list.tables Function returns the results of tables in Hive.
rhive.list.tables()
tab_name
1
aids2
2
new_usarrests
3
usarrests
This is effectively identical to this:
rhive.query("SHOW
TABLES")
rhive.desc.table
The rhive.desc.table Function shows the description of the chosen table.
rhive.desc.table("usarrests")
5.
col_name
data_type
comment
1
rowname
string
2
murder
double
3
assault
int
4
urbanpop
int
5
rape
double
This is effectively identical to this:
rhive.query("DESC
usarrests")
rhive.load.table
The rhive.load.table Function loads Hive tables’ contents as R’s data.frame
object.
df1
<-‐
rhive.load.table("usarrests")
df1
This is effectively identical to this:
df1
<-‐
rhive.query("SELECT
*
FROM
usarrests")
df1
rhive.write.table
The rhive.write.table Function is the antithesis of rhive.load.table.
But it is more useful than rhive.load.table.
If you wish to add data to a table located in Hive, you must first make a table.
But using rhive.write.table does not require any additional work, and simply
creates R’s dataframe into Hive and inserts all data.
head(UScrime)
M
So
Ed
Po1
Po2
LF
M.F
Pop
NW
U1
U2
GDP
Ineq
Prob
Time
y
1
151
1
91
58
56
510
950
33
301
108
41
394
261
0.084602
26.2011
791
7. 0.034201
20.9995
7
7
127
1
111
82
79
519
982
4
139
97
38
620
168
0.042100
20.6993
8
8
131
1
109
115
109
542
969
50
179
79
35
472
206
0.040099
24.5988
9
9
157
1
90
65
62
553
955
39
286
81
28
421
239
0.071697
29.4001
10
10
140
0
118
71
68
632
1029
7
15
100
24
526
174
0.044498
19.5994
y
1
791
2
1635
3
578
4
1969
5
1234
6
682
7
963
8
1555
9
856
10
705
The rhive.write.table Function encounters an error and does not work if the
table to be saved into Hive already exists.
Hence, if attempting to save to Hive any dataframes with the same name and
symbol as any table already in Hive, it is imperative that you delete them
before using rhive.write.table.
if
(rhive.exist.table("uscrime"))
{
rhive.query("DROP
TABLE
uscrime")
}
rhive.write.table(UScrime)
8. RHive - alias functions
RHive’s Functions look similar to S3 generic’s naming rules but many are
actually not generic. This is for the S3 generic Functions which RHive may or
may not support in the future.
For users who detest confusion wrought by Functions that, despite containing
“.” yet still do not count as generic, there exist some Functions with different
names but serve the same roles. The following alias Functions are such as
described below.
hiveConnect
This is same as rhive.connect.
hiveQuery
This is same as rhive.query.
hiveClose
This is same as hive.close.
hiveListTables
This is same as hive.list.tables.
hiveDescTable
This is same as hive.desc.table.
hiveLoadTable
This is same as hive.load.table.
9. rhive.basic.cut
rhive.basic.cut converts one numerical column from a table to one factorized
column. First, the range of the numerical column is divided into intervals, and
the values in the numerical column are factorized according to which interval
they fall. Rhive.basic.cut receives the following six arguments, tablename(a
table name), col(a numerical column name), breaks, right, summary, and
forcedRef. breaks are numerical cut points for the numerical column. right
indicates if the ends of the intervals are open or closed. If TRUE, the intervals
are closed on the right and open on the left. If not, vice versa. summary =
TRUE spits out total counts of numerical values corresponding to the intervals.
If FALSE, the name of a new table updated by the factorized table is returned.
forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a
data frame for forcedRef = FALSE. The defaults of right, summary,
and forcedRef are TRUE, FALSE, and TRUE respectively.
Example for summary = FALSE
>
table_name
=
rhive.basic.cut(tablename
=
"iris",
col
=
"sepallength",
breaks
=
seq(0,
5,
0.5),
right
=
FALSE,
summary
=
FALSE,
forcedRef
=
TRUE)
>
table_name
[1]
"rhive_result_1330382904"
attr(,"result:size")
[1]
4296
>
results
=
rhive.query("select
*
from
rhive_result_1330382904")
>
head(results)
rowname
sepalwidth
petallength
petalwidth
species
sepallength
1
1
3.5
1.4
0.2
setosa
NULL
2
2
3.0
1.4
0.2
setosa
[4.5,5.0)
3
3
3.2
1.3
0.2
setosa
[4.5,5.0)
4
4
3.1
1.5
0.2
setosa
[4.5,5.0)
5
5
3.6
1.4
0.2
setosa
NULL
6
6
3.9
1.7
0.4
setosa
NULL
Example for summary = TRUE
10. >
summary
=
rhive.basic.cut(tablename
=
"iris",
col
=
"sepallength",
breaks
=
seq(0,
5,
0.5),
right
=
FALSE,
summary
=
TRUE,
forcedRef
=
TRUE)
>
summary
NULL
[4.0,4.5)
[4.5,5.0)
128
4
18
rhive.basic.cut2
rhive.basic.cut2 converts two numerical columns from a table to two factorized
columns. That is, the range of each numerical column is divided into intervals,
and the values in each numerical column are factorized according to which
interval they fall. Rhive.basic.cut2 receives the following eight arguments,
tablename(a table name), col1, col2(two column names), breaks1, breaks2,
right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut points
for the two numerical columns. right indicates if the ends of the intervals are
open or closed. If TRUE, the intervals are closed on the right and open on the
left. If not, vice versa. keepCol = TRUE makes the two numerical columns
kept even after the conversion. Otherwise, the factorized columns replace the
original numerical columns. forcedRef = TRUE forces rhive.basic.cut to return
a table name instead of a data frame for forcedRef = FALSE. The defaults of
right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.
Example for right = TRUE and keepCol = FALSE
> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2
= "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right =
TRUE, keepCol = FALSE, forcedRef = TRUE)
> table_name
[1] "rhive_result_1330385833"
attr(,"result:size")
[1] 5272
> results = rhive.query("select * from rhive_result_1330385833")
> head(results)
12. 5
5
3.6
0.2
setosa
5.0
N
ULL
1.4
[1.0,1.5)
1
rhive.basic.xtabs
rhive.basic.xtabs makes a contingency table from cross-classifying factors. A
formula object and a table name are used as input arguments and a
contingency table with matrix format is returned based on the given formula.
For instance, two column names, agegp and alcgp from a table are cross-
classifying factors in this formula, "ncontrols ~ agegp + alcgp".
Also, observations for each combination of the cross-classifying
factors are summed up through another column name, ncontrols.
Example for esoph data
>
xtab_formula
=
as.formula(paste("ncontrols","~",
"agegp",
"+","alcgp",sep
=""))
>
xtab_formula
ncontrols
~
agegp
+
alcgp
>
table_result
=
rhive.basic.xtabs(formula
=
xtab_formula,
tablename
=
"esoph")
>
head(table_result)
alcgp
agegp
0-‐39g/day
120+
40-‐79
80-‐119
25-‐34
61
5
45
5
35-‐44
89
10
80
20
45-‐54
78
15
81
39
55-‐64
89
26
84
43
65-‐74
71
8
53
29
75+
27
3
12
2
rhive.basic.t.test
The rhive.basic.t.test Function runs Welch's t-test on two samples. In this case
the two sample's mean difference is tested while holding the alternative
hypothesis, "two sample's mean difference is not 0." Thus, two-side test is
performed.
13. The following is an example of test the mean difference between the irises'
sepal widths and petal widths. Pay attention to how the Functions that used
the "sepallength" and "petallength" variables were called.
>
rhive.basic.t.test("iris",
"sepallength",
"iris",
"petallength")
[1]
"t
=
13.1422338118038,
df
=
211.542688378717,
p-‐value
=
0,
mean
of
x
:
5.84333333333333,
mean
of
y
:
3.758"
$statistic
t
13.14223
$parameter
df
211.5427
$p.value
[1]
0
$estimate
$estimate[[1]]
mean
of
x
5.843333
$estimate[[2]]
mean
of
y
3.758
>
Interpreting the results gives you a p-value of 0, thus revealing a difference
between the means of petal width and sepal width. The resulting statistics are
converted as an R list Object, and the string made from amassed statistics is
printed onto console.
Iris data is 150 observation cases provided by R. Using this data for R's t.test
results in a slightly off t-statistic of 13.0984. This is due to the variance used
by t.test Function to find t-statistic is sample variance, while rhive.basic.t.test
Function uses population variance. Like the example scenario, in the case of
little data, t-statistic deviance may exist but the larger the data gets the
deviance dwindles. With rhive.basic.t.test being a Function made for massive
data analysis in mind, population variance is used for speedy calculations.
14. rhive.block.sample
The percent argument is an optional argument that sets the percentage of
data to extract from the total data. It has a default value of 0.01, which means
it extracts 0.01% of the total data. But this percent argument's value is not the
ratio of the actually sampled data count to the total data count but more akin
to the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Function
takes Samples by the Block.
Thus the entire data may be returned when using the rhive.block.sample
Function on Hive Tables of small data size. This occurs when the data is
smaller than the Block size set in Hive.
The seed variable is for specifying the Random Seed used when executing
Block Sampling in Hive. Should the Random Seeds be identical, Hive's Block
Sampling returns the same results. Thus in order to guarantee Random
Samples for every sampling, it is best to assign a value for the seed variable
in rhive.block.sample, by using the Sample Function of R.
The subset variable is an optional variable that can specify the condition for
the data to be extracted from the Table targeted by Hive, when returning
Sample Block. This argument uses the character type and corresponds to the
'where' clause in Hive HQL. Thus it must use syntax appropriate for HQL's
where clause.
rhive.block.sample Function's return values are the character values of the
name of the Hive Table that contain Sample Block results. That is, the
rhive.block.sample Function uses Sample Block to automatically create a
temporary Hive Table and return that Table's name. The following example
involves sampling data worth 0.01% of the Hive Table called
listvirtualmachines. This example used R's sample Function for the Random
Seed to be used during Block Sampling of Hive.
seedNumber
<-‐
sample(1:2^16,
1)
rhive.block.sample("listvirtualmachines",
seed=seedNumber
)
[1]
"rhive_sblk_1330404552"
15. As per this example, a Hive Table of the name "rhive_sblk_1330404552"
bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", has
been created.
rhive.basic.scale
The rhive.basic.scale function converts numerical data with 0 average and 1
deviation. Input table name for the first argument, and the output column
name for the second.
In the returned list, there is added a "scaled_column name" column saved as
a string. This is also approachable/editable in RHive, along with/just like other
Hive tables.
scaled
<-‐
rhive.basic.scale("iris",
"sepallength")
attr(scaled,
"scaled:center")
#
[1]
5.843333
attr(scaled,
"scaled:scale")
#
[1]
0.8253013
>
rhive.desc.table(scaled[[1]])
col_name
data_type
comment
#
1
rowname
string
#
2
sepalwidth
double
#
3
petallength
double
#
4
petalwidth
double
#
5
species
string
#
6
sepallength
double
#
7
sacled_sepallength
double
rhive.basic.by
The rhive.basic.by Function consists of code that runs group by for a
specified/particular column. Thus the code below excecutes/applies group by
for "species" column, and returns the result of applying the sum Function on
16. "sepallength". In the results you will find the sum of each species and
sepallength.
rhive.basic.by("iris",
"species",
"sum","sepallength")
#
species
sum
#
1
setosa
250.3
#
2
versicolor
296.8
#
3
virginica
329.4
rhive.basic.merge
rhive.basic.merge makes new data set from merging two tables, based on
their common rows.
#
checking
data
rhive.query('select
*
from
iris
limit
5')
rowname
sepallength
sepalwidth
petallength
petalwidth
species
1
1
5.1
3.5
1.4
0.2
setosa
2
2
4.9
3.0
1.4
0.2
setosa
3
3
4.7
3.2
1.3
0.2
setosa
4
4
4.6
3.1
1.5
0.2
setosa
5
5
5.0
3.6
1.4
0.2
setosa
rhive.query('select
*
from
usarrests
limit
5')
rowname
murder
assault
urbanpop
rape
1
Alabama
13.2
236
58
21.2
2
Alaska
10.0
263
48
44.5
3
Arizona
8.1
294
80
31.0
4
Arkansas
8.8
190
50
19.5
5
California
9.0
276
91
40.6
##rhive.basic.merge
rhive.basic.merge('iris','usarrests',by.x='sepallength',by.y='
17. murder')
sepallength
sepalwidth
petallength
petalwidth
species
assault
urbanpop
rape
rowname
1
4.3
3.0
1.1
0.1
setosa
102
62
16.5
14
2
4.4
2.9
1.4
0.2
setosa
149
85
16.3
9
3
4.4
3.0
1.3
0.2
setosa
149
85
16.3
39
4
4.4
3.2
1.3
0.2
setosa
149
85
16.3
43
5
4.9
3.1
1.5
0.1
setosa
159
67
29.3
10
Merge is similar with ‘join’ in SQL. Followings are same with that.
#
Use
join
to
extract
and
print
the
names
of
all
rows
not
found
to
be
common
after
merging.
#
Should
row
names
overlap,
only
print
out
the
name
of
the
former
row.
rhive.big.query('select
a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species
,b.assault,b.urbanpop,b.rape,a.rowname
from
iris
a
join
usarrests
b
on
a.sepallength
=
b.murder')
sepallength
sepalwidth
petallength
petalwidth
species
assault
urbanpop
rape
rowname
1
4.3
3.0
1.1
0.1
setosa
102
62
16.5
14
2
4.4
2.9
1.4
0.2
setosa
149
85
16.3
9
3
4.4
3.0
1.3
0.2
setosa
149
85
16.3
39
4
4.4
3.2
1.3
0.2
setosa
149
85
16.3
43
5
4.9
3.1
1.5
0.1
setosa
159
67
29.3
10
18. rhive.basic.mode
rhive.basic.mode returns the mode and its frequency within a specified row of
the Hive table.
rhive.basic.mode('iris',
'sepallength')
sepallength
freq
1
5
10
rhive.basic.range
rhive.basic.range returns the greatest and lowest values within the specified
numerical row of the Hive table.
rhive.basic.range('iris',
'sepallength')
[1]
4.3
7.9