RHive aims to integrate R and Hive by allowing analysts to use R's familiar environment while leveraging Hive's capabilities for big data analysis. RHive allows R functions and objects to be used in Hive queries through RUDFs and RUDAFs. It also provides functions like napply to analyze big data in HDFS using R in a distributed manner. RHive provides a bridge between the two environments without requiring users to learn MapReduce programming.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
This project encompasses the sentiment expressed in social media (Twitter) for smartphones.
How to Perform text mining on Hadoop data and analyze the same by integrating R with Hadoop.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
This project encompasses the sentiment expressed in social media (Twitter) for smartphones.
How to Perform text mining on Hadoop data and analyze the same by integrating R with Hadoop.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
Nesta apresentação é demonstrado alguns recursos disponíveis num cluster Hadoop, bem como os principais componentes do ecossistema utilizado no Magazine Luiza. Além disso, temos uma comparação com grandes nomes do mercado que também utilizam esta tecnologia.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
Nesta apresentação é demonstrado alguns recursos disponíveis num cluster Hadoop, bem como os principais componentes do ecossistema utilizado no Magazine Luiza. Além disso, temos uma comparação com grandes nomes do mercado que também utilizam esta tecnologia.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
How to use hadoop and r for big data parallel processingBryan Downing
How to use hadoop and r for big data parallel processing
From Ram V at Dawn Analytics
http://quantlabs.net/membership.htm
Present to Meetup.com/quant-finance
Key lecture for the EURO-BASIN Training Workshop on Introduction to Statistical Modelling for Habitat Model Development, 26-28 Oct, AZTI-Tecnalia, Pasaia, Spain (www.euro-basin.eu)
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
An introduction to Hadoop. This seminar was intended to non IT engineers but more NLP specialists and cognitive scientists.
See the blog post for more information on this presentation.
This lecture is about The R Language use for statistical computing for beginners who want to start working on financial analytics using R software. Moreover, lecture is beneficial for finance professional who want to work on R for financial data analysis.
Many experts believe that ageing can be delayed, this is one of the main goals of the the Institute of Healthy Ageing at University College London. I will present the results of my lifespan-extension research where we integrated publicly available genes databases in order to identify ageing related genes. I will show what challenges we met and what we have learned about the process of ageing.
Ageing is one of the fundamental mysteries in biology and many scientists are starting to study this fascinating process. I am part of the research group led by Dr Eugene Schuster at UCL Institute of Healthy Ageing. We experiment with Drosophila and Caenorhabditis elegans by modifying their genes in order to create long-lived mutants. The results of our experiments are quantified using high-throughput microarray analysis. Finally we apply information technology in order to understand how the ageing process works. I will show how we mine microarrays data in order to find the connections between thousands of genes and how we identify candidates for ageing genes.
We are interested in building a better understanding of genes functions by harnessing the large quantity of experimental microarray data in the public databases. Our hope is that after understanding the ageing process in simpler organisms we will be able to apply this knowledge in humans.
Cross-referencing expressions levels in thousands of genes and hundreds of experiments turned out to be a computationally challenging problem but Hadoop and Amazon cloud came to our rescue. In this talk I will present a case study based on our use of R with Amazon Elastic MapReduce and will give background on our bioinformatics challenges.
These slides were presented at ApacheCon Europe 2012:
http://www.apachecon.eu/schedule/presentation/3/
Presentation given by US Chief Scientist, Mario Inchiosa, at the June 2013 Hadoop Summit in San Jose, CA.
ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
Brief introduction to the R language and its use in large data applications. Tools and techniques for data ingestion and analytics are touched on. Packages and support for publishing research and visualizations are described.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
3. Analysis of Data
CF Classifier
Decision Tree
MapReduce Recommendation
Graph
Clustering
Friday, November 11, 11
4. Related Works
• RHIPE
• RHadoop
duce
R e
• hive (Hadoop InteractiVE)
Map
and
• seuge
r st
un de
u st
M
Friday, November 11, 11
5. RHive is inspired by ...
• Many analysts have been used R for a long time
• Many analysts can use SQL language
• There are already a lot of statistical functions in R
• R needs a capability to analyze big data
• Hive supports SQL-like query language (HQL)
• Hive supports MapReduce to execute HQL
R is the best solution for familiarity
Hive is the best solution for capability
Friday, November 11, 11
6. RHive Components
• Hadoop
• store and analyze big data
• Hive
• use HQL instead of MapReduce programming
• R
• support friendly environment to analysts
Friday, November 11, 11
7. RHive - Architecture
Execute R Function Objects and R Objects
through Hive Query
Execute Hive Query through R
rcal <- function(arg1,arg2) {
SELECT R(‘rcal’,col1,col2) coeff * sum(arg1,arg2)
from tab1 }
RServe
01010100101 01010100101 01010100101
01010010101 01010010101 01010010101
01001010101 01001010101 01001010101
10101000111 10101000111 10101000111
R Function R Object RUDF RUDAF
Friday, November 11, 11
9. RUDF - R User-defined Functions
SELECT R(‘R function-name’,col1,col2,...,TYPE)
• UDF doesn’t know return type until calling R function
• TYPE : return type
Example : R function which sums all passed columns
sumCols <- function(arg1,...) {
sum(arg1,...)
}
rhive.assign(‘sumCols’,sumCols)
rhive.exportAll(‘sumCols’,hadoop-clusters)
result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)
plot(result)
Friday, November 11, 11
10. RUDAF - R User-defined Aggregation Function
SELECT RA(‘R function-name’,col1,col2,...)
• R can not manipulate large dataset
• Support UDAF’s life cycle
• iterate, partial-merge, merge, terminate
• Return type is only string delimited by ‘,’ - “data1,data2,data3,...”
partial aggregation partial aggregation
aggregation values
FUN FUN.partial FUN.merge FUN.terminate
01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101
01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101
01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101
10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111
Friday, November 11, 11
11. UDTF : unfold and expand
• RUDAF only returns string delimited by ‘,’
• Convert RUDAF’s result to R data.frame
unfold(string_value,type1,type2,...,delimiter)
expand(string_value,type,delimiter)
RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key
select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’,
attr1,attr2,attr3,attr4) as x from table group by cluster-key
Friday, November 11, 11
12. napply and sapply
rhive.napply(table-name,FUN,col1,...)
rhive.sapply(table-name,FUN,col1,...)
• napply : R apply function for Numeric type
• sapply : R apply function for String type
Example : R function which sums all passed columns
sumCols <- function(arg1,...) {
sum(arg1,...)
}
result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)
rhive.load.table(result)
Friday, November 11, 11
13. napply
• ‘napply’ is similar to R apply function
• Store big result to HDFS as Hive table
rhive.napply <- function(tablename, FUN, col = NULL, ...) {
if(is.null(col))
cols <- ""
else
cols <- paste(",",col)
for(element in c(...)) {
cols <- paste(cols,",",element)
}
exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")
! rhive.assign(exportname,FUN)
! rhive.exportAll(exportname)
tmptable <- paste(exportname,”_table”)
! rhive.query(
paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))
! tmptable
}
Friday, November 11, 11
14. aggregate
rhive.aggregate(table-name,hive-FUN,...,goups)
• RHive aggregation function to aggregate data stored in HDFS using HIVE Function
Example : Aggregate using SUM (Hive aggregation function)
result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)
rhive.load.table(result)
Friday, November 11, 11
15. Examples - predict flight delay
library(RHive)
rhive.connect()
- Retrieve training set from large dataset stored in HDFS
train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())
train$arrdelay <- as.numeric(train$arrdelay)
train$distance <- as.numeric(train$distance)
train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),] Native R code
model <- lm(arrdelay ~ distance + dayofweek,data=train)
- Export R object data
rhive.assign("model", model)
- Analyze big data using model calculated by R
predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {
if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0) HiveQuery + R code
res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))
return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)
Friday, November 11, 11
17. Conclusion
• RHive supports HQL, not MapReduce model style
• RHive allows analytics to do everything in R console
• RHive interacts R data and HDFS data
• Future & Current Works
• Integrate Hadoop HDFS
• Support Transform/Map-Reduce Scripts
• Distributed Rserve
• Support more R style API
• Support machine learning algorithms (k-means, classifier, ...)
Friday, November 11, 11
18. Cooperators
• JunHo Cho
• Seonghak Hong
• Choonghyun Ryu
YO U !
Friday, November 11, 11
19. How to join RHive project
• Logo
• github (https://github.com/nexr/RHive)
• CRAN (http://cran.r-project.org/web/packages/RHive)
• Welcome to join RHive project
Friday, November 11, 11
20. References
• Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)
• RHIPE (http://ml.stat.purdue.edu/rhipe)
• Hive (http://hive.apache.org)
• Parallels R by Q. Ethan McCallum and Stephen Weston
Friday, November 11, 11
23. RHIPE
• the R and Hadoop Integrated Processing Environment
• Must understand the MapReduce model
map <- function() {...} shuffle / sort
reduce <- function() {...} Mapper Reducer
rmr <- rhmr(map,reduce,...)
ProtocolBuf
R
Fork R R
RHMR
R Objects (map) R Objects (reduce)
R Objects (map, reduce) PersonalServer
R Conf
HDFS
Friday, November 11, 11
24. RHadoop
• Manipulate Hadoop data stores and HBASE directly from R
• Write MapReduce models in R using Hadoop Streaming
• Must understand the MapReduce model
map <- function() {...}
reduce <- function() {...}
mapreduce(input,output,map,reduce,...)
R
rhbase rhdfs rmr
execute hadoop
streaming
R R
Hadoop Streaming
manipulate shuffle / sort
store R objecs as file Mapper Reducer
HBASE HDFS
Friday, November 11, 11
25. hive(Hadoop InteractiVE)
• R extension facilitating distributed computing via the MapReduce
paradigm
• Provide an interface to Hadoop, HDFS and Hadoop Streaming
• Must understand the MapReduce model
map <- function() {...}
reduce <- function() {...}
hive_stream(map,reduce,...)
R
execute hadoop R R
hive DFS hive_stream
streaming
Hadoop Streaming
manipulate shuffle / sort
save R script on local Mapper Reducer
HDFS
Friday, November 11, 11
26. seuge
• Simple parallel processing with a fast and easy setup on Amazon’s WS.
• Parallel lapply function for the EMR engine using Hadoop streaming.
• Does not support MapReduce model but only Map model.
data <- list(...)
emrlapply(clusterObject,data,FUN,..) Amazon S3
upload R objects
R
emrlapply awsFunctions EMR Hadoop Streaming
R
Mapper
save R objects (data + FUN) on local
bootstrap (setup R)
mapper.R
Friday, November 11, 11
27. Ricardo
• Integrate R and Jaql (JSON Query Language)
• Must know how to use uncommon query, Jaql
• Not open-source
Ref : Ricardo-SIGMOD10
Friday, November 11, 11