Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
Quick overview of programming Apache Hadoop with R. Jonathan Seidman's sample code allows a quick comparison of several packages followed by a real example using RHadoop's rmr package. Our example demonstrates using compound (vs. single-field) keys and values and shows the data coming into and out of our mapper and reducer functions.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
Part 1 of 3 of series focusing on the infrastructure aspect of getting started with Big Data, specifically Hadoop. This presentation starts small, installing a pre-packaged virtual machine from Hadoop vendor Cloudera on your local machine.
We then install R, copy some sample data into HDFS and test everything by running Jonathan Seidman's a sample streaming job.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
Quick overview of programming Apache Hadoop with R. Jonathan Seidman's sample code allows a quick comparison of several packages followed by a real example using RHadoop's rmr package. Our example demonstrates using compound (vs. single-field) keys and values and shows the data coming into and out of our mapper and reducer functions.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
Part 1 of 3 of series focusing on the infrastructure aspect of getting started with Big Data, specifically Hadoop. This presentation starts small, installing a pre-packaged virtual machine from Hadoop vendor Cloudera on your local machine.
We then install R, copy some sample data into HDFS and test everything by running Jonathan Seidman's a sample streaming job.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012
Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
CONFidence 2014: Davi Ottenheimer Protecting big data at scalePROIDEA
We are meant to measure and manage data with more precision than ever before using Big Data. But companies are getting Hadoopy often with little or no consideration of security. Are we taking on too much risk too fast? This session explains how best to handle the looming Big Data risk in any environment. Better predictions and more intelligent decisions are expected from our biggest data sets, yet do we really trust systems we secure the least? And do we really know why "learning" machines continue to make amusing and sometimes tragic mistakes? Infosec is in this game but with Big Data we appear to be waiting on the sidelines. What have we done about emerging vulnerabilities and threats to Hadoop as it leaves many of our traditional data paradigms behind? This presentation, based on the new book "Realities of Big Data Security" takes the audience through an overview of the hardest big data protection problem areas ahead and into our best solutions for the elephantine challenges here today.
Dedupe, Merge and Purge: the art of normalizationTyler Bell
This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
Hadoop Training is cover Hadoop Administration training and Hadoop developer by Keylabs. we provide best Hadoop classroom & online-training in Hyderabad&Bangalore.
http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
hadoop training, hadoop online training, hadoop training in bangalore, hadoop training in hyderabad, best hadoop training institutes, hadoop online training in chicago, hadoop training in mumbai, hadoop training in pune, hadoop training institutes ameerpet
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
CONFidence 2014: Davi Ottenheimer Protecting big data at scalePROIDEA
We are meant to measure and manage data with more precision than ever before using Big Data. But companies are getting Hadoopy often with little or no consideration of security. Are we taking on too much risk too fast? This session explains how best to handle the looming Big Data risk in any environment. Better predictions and more intelligent decisions are expected from our biggest data sets, yet do we really trust systems we secure the least? And do we really know why "learning" machines continue to make amusing and sometimes tragic mistakes? Infosec is in this game but with Big Data we appear to be waiting on the sidelines. What have we done about emerging vulnerabilities and threats to Hadoop as it leaves many of our traditional data paradigms behind? This presentation, based on the new book "Realities of Big Data Security" takes the audience through an overview of the hardest big data protection problem areas ahead and into our best solutions for the elephantine challenges here today.
Dedupe, Merge and Purge: the art of normalizationTyler Bell
This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
This talk was prepared as a note to my future self when working on future projects. I reflect on the tasks commonly involved in crafting visualizations, point out the common things to expect, pitfalls and provide recommendations. Along the way I include examples of 3 different applications of information/data visualization and details on how each project was started and developed.
These slides were from my guest lecture in InfoVis class at
(1) InfoVis class at UC Berkeley iSchool on Feb 27, 2017. Thank you Prof. Marti Hearst for the invitation.
(2) DataVis class at GATech on Apr 5, 2017. Thank you Prof. Rahul C. Basole for the invitation.
This talk was prepared as a note to my future self when working on future projects. I reflect on the tasks commonly involved in crafting visualizations, point out the common things to expect, pitfalls and provide recommendations. Along the way I include examples of different applications of information/data visualization and details on how each project was started and developed.
These slides were from my (remote) guest lecture in InfoVis class for UC Berkeley iSchool on Apr 8, 2020 during the COVID-19 shelter-in-place. Thank you Prof. Marti Hearst for the invitation.
A controversial discussion of the utility of DBpedia as authority data with examples from a project at the Library of Congress. Part of an ExLibris-sponsored panel discussion at ALA Chicago 2009.
Kelly technologies is the best data science training institute in hyderabad.We provide our trainings by industrial real time experts so that our students know about real time market technology.
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
Part 3 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation demonstrates how to use Apache Whirr to launch a Hadoop cluster on Amazon EC2--easily.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
Part 2 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation is geared towards anyone with an occasional need for more computing power.
We walk through the mechanics of launching a instance on Amazon's EC2, install some software (like R and RStudio), and make sure it all works.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
R's googleVis package makes it easy to use the Google Visualization API with your data. Here we demonstrate how to create a Hans Rosling-style motion chart with some sample data. Just one line of R code automatically generates 165 lines of HTML and JavaScript for us. This "lightning talk" was presented at the July 2011 meeting of the Greater Boston useR meeting.
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
A quick tutorial for the Boston Predictive Analytics MeetUp to demonstrate the use of R in the context of text mining Twitter. We implement a very crude algorithm for sentiment analysis but still get a plausible result.
Overview of accessing relational databases from R. Focuses and demonstrates DBI family (RMySQL, RPostgreSQL, ROracle, RJDBC, etc.) but also introduces RODBC. Highlights DBI's dbApply() function to combine strengths of SQL and *apply() on large data sets. Demonstrates sqldf package which provides SQL access to standard R data.frames.
Presented at the May 2011 meeting of the Greater Boston useR Group.
Overview of how/why to reshape data in R from "wide" (spreadsheet-like) to "long" (database-like) and back.
Focuses on Hadley Wickham's reshape2 package and uses state population data from the 2010 U.S. Census. Also demonstrates use of dcast() to replace table(), etc. to generate crosstabs from a sample market research consumer survey.
Presented at the April 2011 meeting of the Greater Boston useR Group.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Overview of JD Long's experimental "segue" package which marshals and manages Hadoop clusters (for non-Big Data problems) with Amazon's Elastic MapReduce service.
Presented at the February 2011 meeting of the Greater Boston useR Group.
presented by Nan Shellabarger, Director, Aviation Policy and Plans, Federal Aviation Administration at the 36th Annual FAA Aviation Forecast Conference, February 2011
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
1. Tapping the Data Deluge with R
Finding and using supplemental data
to add context to your analysis
by Jeffrey Breen
Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata email: jeffrey.breen@thinkbiganalytics.com
blog: http://jeffreybreen.wordpress.com
Twitter: @JeffreyBreen
1
2. Data data everywhere!
This may be how you picture the data deluge looks like if you work for the Economist.
But those of us who wrangle data for living know that it’s usually not so prosaic or buttoned-down, proper or quaint.
3. Real data hits us in the face...
3
Real data can hit you in the face.
Yet we keep coming back for more.
4. ...and then there’s Big Data.
4
And I’m not even going to talk about Big Data tonight. (For a change!)
5. Finding the right data makes all the difference
5
Tonight we’re going to look at a few different places to find those data sets which can make a difference, and a few techniques
to access them so you can incorporate them into your analysis.
6. The two types of data
Data you have
Data you don’t
have... yet
6
Perhaps you’ve heard the joke: There are two kinds of people: People who think there are two kinds of people and people
who don’t.
I like to think that there are two kinds of data.
7. The two types of data
• Data you have
– CSV files, spreadsheets
– files from other sta>s>cs packages (SPSS, SAS, Stata,...)
– databases, data warehouses (SQL, NoSQL, HBase,...)
– whatever your boss emailed you on his way to lunch
– datasets within R and R packages
• Data you don’t have... yet
– file downloads & web scraping
– data marketplaces and other APIs
Code & Data on github: http://bit.ly/pawdata 7
8. Reading CSV files is easy
$ head -5 data/mpg-3-13-2012.csv | cut -c 1-60
"Model Yr","Mfr Name","Division","Carline","Verify Mfr Cd","
2012,"aston martin","Aston Martin Lagonda Ltd","V12 Vantage"
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
data = read.csv('data/mpg-3-13-2012.csv')
View(data)
see R/01-read.csv-mpg.R 8
9. But so is reading Excel files directly
library(XLConnect)
wb = loadWorkbook("data/mpg.xlsx", create=F)
data = readWorksheet(wb, sheet='3-7-2012')
see R/02-XLConnect-mpg.R 9
11. RelaMonal databases
library(RMySQL)
con = dbConnect(MySQL(), user="root", dbname="test")
data = dbGetQuery(con, "select * from airport")
dbDisconnect(con)
View(data)
airport_code airport_name location state_code country_name time_zone_code
1 ATL WILLIAM B. HARTSFIELD ATLANTA,GEORGIA GA USA EST
2 BOS LOGAN INTERNATIONAL BOSTON,MASSACHUSETTS MA USA EST
3 BWI BALTIMORE/WASHINGTON INTERNATIONAL BALTIMORE,MARYLAND MD USA EST
4 DEN STAPLETON INTERNATIONAL DENVER,COLORADO CO USA MST
5 DFW DALLAS/FORT WORTH INTERNATIONAL DALLAS/FT. WORTH,TEXAS TX USA CST
6 OAK METROPOLITAN OAKLAND INTERNATIONAL OAKLAND,CALIFORNIA CA USA PST
7 PHL PHILADELPHIA INTERNATIONAL PHILADELPHIA PA/WILM'TON,DE PA USA EST
8 PIT GREATER PITTSBURGH PITTSBURGH,PENNSYLVANIA PA USA EST
9 SFO SAN FRANCISCO INTERNATIONAL SAN FRANCISCO,CALIFORNIA CA USA PST
see R/04-RMySQL-airport.R 11
12. Non-‐relaMonal databases too
> library(rhbase)
> hb.init(serialize='raw')
> x = hb.get(tablename='tweets', rows='221325531868692480')
> str(x)
List of 1
$ :List of 3
..$ : chr "221325531868692480"
..$ : chr [1:10] "created:" "favorited:" "id:" "replyToSID:" ...
..$ :List of 10
.. ..$ : chr "2012-07-06 19:31:33"
.. ..$ : chr "FALSE"
.. ..$ : chr "221325531868692480"
.. ..$ : chr "NA"
.. ..$ : chr "NA"
.. ..$ : chr "NA"
.. ..$ : chr "arnicas"
.. ..$ : chr "<a href="http://www.tweetdeck.com"
rel="nofollow">TweetDeck</a>"
.. ..$ : chr "RT @bycoffe: From @DrewLinzer, an #Rstats function for querying
the HuffPost Pollster API. http://t.co/fXnG32JX cc @thewhyaxis"
.. ..$ : chr "FALSE"
12
13. weird emails from the boss
con = textConnection('
# Hi:
#
# Please invite these paid volunteers to the spontaneous rally at 3PM today:
#
Name Department "Hourly Rate" email
Alice Operations 32 alice@wonderland.org
Billy Logistics 5 billy.pilgrim@slaugterhouse5.com
Winston Records 20 winston.smith@truth.gov.oc
#
#Thanks,
#Your Boss
#! ! ! ! !
')
data = read.table(con, header=T, comment.char='#')
close.connection(con)
View(data) Name Department Hourly.Rate email
1 Alice Operations 32 alice@wonderland.org
2 Billy Logistics 5 billy.pilgrim@slaugterhouse5.com
3 Winston Records 20 winston.smith@truth.gov.oc
see R/05-textConnection-email.R 13
14. > data()
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock
Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure Personal Expenditure Data
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
[...]
15. > library(zipcode)
> data(zipcode)
> str(zipcode)
'data.frame': 44336 obs. of 5 variables:
$ zip : chr "00210" "00211" "00212" "00213" ...
$ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
$ state : chr "NH" "NH" "NH" "NH" ...
$ latitude : num 43 43 43 43 43 ...
$ longitude: num -71 -71 -71 -71 -71 ...
> subset(zipcode, city=='Boston' & state=='MA')
zip city state latitude longitude
664 02101 Boston MA 42.37057 -71.02696
665 02102 Boston MA 42.33895 -70.91963
666 02103 Boston MA 42.33895 -70.91963
667 02104 Boston MA 42.33895 -70.91963
668 02105 Boston MA 42.33895 -70.91963
669 02106 Boston MA 42.35432 -71.07345
670 02107 Boston MA 42.33895 -70.91963
671 02108 Boston MA 42.35790 -71.06408
672 02109 Boston MA 42.36148 -71.05417
673 02110 Boston MA 42.35653 -71.05365
674 02111 Boston MA 42.34984 -71.06101
675 02112 Boston MA 42.33895 -70.91963
676 02113 Boston MA 42.36503 -71.05636
677 02114 Boston MA 42.36179 -71.06774
678 02115 Boston MA 42.34308 -71.09268
679 02116 Boston MA 42.34962 -71.07372
680 02117 Boston MA 42.33895 -70.91963
681 02118 Boston MA 42.33872 -71.07276
682 02119 Boston MA 42.32451 -71.08455
683 02120 Boston MA 42.33210 -71.09651
684 02121 Boston MA 42.30745 -71.08127
685 02122 Boston MA 42.29630 -71.05454
686 02123 Boston MA 42.33895 -70.91963
687 02124 Boston MA 42.28713 -71.07156
688 02125 Boston MA 42.31685 -71.05811
690 02127 Boston MA 42.33499 -71.04562
691 02128 Boston MA 42.37830 -71.02550
696 02133 Boston MA 42.33895 -70.91963
726 02163 Boston MA 42.36795 -71.12056
757 02196 Boston MA 42.33895 -70.91963
[...]
17. The two types of data
• Data you have
– CSV files, spreadsheets
– files from other sta>s>cs packages (SPSS, SAS, Stata,...)
– databases, data warehouses (SQL, NoSQL, HBase,...)
– whatever your boss emailed you on his way to lunch
– datasets within R and R packages
• Data you don’t have... yet
– file downloads & web scraping
– data marketplaces and other APIs
Code & Data on github: http://bit.ly/pawdata 17
18.
19.
20. Many base funcMons take URLs
url = 'http://ichart.finance.yahoo.com/table.csv?
s=YHOO&d=8&e=28&f=2012&g=d&a=3&b=12&c=1996&
ignore=.csv'
data = read.csv(url)
ggplot(data) + geom_point(aes(x=as.Date(Date),
y=Close), size = 1) + scale_y_log10() + theme_bw()
see R/06-read.csv-url-yahoo.R 20
21.
22. download.file() if URLs aren’t supported
library(XLConnect)
url = "http://www.fueleconomy.gov/feg/EPAGreenGuide/xls/
all_alpha_12.xls"
local.xls.file = 'data/all_alpha_12.xls'
download.file(url, local.xls.file)
wb = loadWorkbook(local.xls.file, create=F)
data = readWorksheet(wb, sheet='all_alpha_12')
View(data)
see R/07-download.file-XLConnect-green.R 22
23. image credit: http://groovynoms.com/2011/07/25/beer-of-the-week-2/
Now, I don’t mean to oversell this next one, but if you’ve spent as much time as I have finding -- and trying to deal with --
interesting data sets on web pages, you might agree that this next function alone is worth the price of admission.
24. not even HTML tables are safe
library(XML)
url = 'http://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'
state.capitals.df = readHTMLTable(url, which=2)
State Abr. Date of statehood Capital Capital since Land area (mi²) Most populous city?
1 Alabama AL 1819 Montgomery 1846 155.4 No
2 Alaska AK 1959 Juneau 1906 2716.7 No
3 Arizona AZ 1912 Phoenix 1889 474.9 Yes
4 Arkansas AR 1836 Little Rock 1821 116.2 Yes
5 California CA 1850 Sacramento 1854 97.2 No
6 Colorado CO 1876 Denver 1867 153.4 Yes
7 Connecticut CT 1788 Hartford 1875 17.3 No
8 Delaware DE 1787 Dover 1777 22.4 No
9 Florida FL 1845 Tallahassee 1824 95.7 No
10 Georgia GA 1788 Atlanta 1868 131.7 Yes
see R/08-readHTMLTable.R 24
As you’d expect from a package called “XML”, it parses well-formed XML files.
But I didn’t expect it would do such a good job with HTML.
And I certainly didn’t expect to find a function as handy as readHTMLTable()!
27. ..and couldn’t be easier to access.
library(rdatamarket)
oil.prod = dmseries("http://data.is/nyFeP9")
plot(oil.prod)
see R/09-rdatamarket.R 27
DataMarket includes its own URL shortner -- like bit.ly but just for their data.
Long or short, just give dmseries() the URL, and it will download the data set for you.
28. Make a withdrawal from the World Bank
> library(WDI)
> WDIsearch('population, total')
indicator name
"SP.POP.TOTL" "Population, total"
> WDIsearch('fertility .*total')
indicator name
"SP.DYN.TFRT.IN" "Fertility rate, total (births per woman)"
> WDIsearch('life expectancy .*birth.*total')
indicator name
"SP.DYN.LE00.IN" "Life expectancy at birth, total (years)"
> WDIsearch('GDP per capita .*constant')
indicator name
[1,] "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)"
[2,] "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)"
> WDIsearch('population, total')
indicator name
"SP.POP.TOTL" "Population, total"
see R/10-WDI.R 28
29. Swedish Accent Not Included
data = WDI(country=c('BR', 'CN', 'GB', 'JP', 'IN', 'SE', 'US'),
! ! ! indicator=c('SP.DYN.TFRT.IN', 'SP.DYN.LE00.IN', 'SP.POP.TOTL',
! ! ! ! ! ! 'NY.GDP.PCAP.KD'),
! ! ! start=1900, end=2010)
library(googleVis)
g = gvisMotionChart(data, idvar='country', timevar='year')
plot(g)
see R/10-WDI.R 29
30. quantmod: the king of symbols
• getSymbols() downloads Mme series data from
source specified by “src” parameter:
– yahoo = Yahoo! Finance
– google = Google Finance
– FRED = St. Louis Fed’s Federal Reserve Economic Data
– oanda = OANDA Forex Trading & Exchange Rates
– csv
– MySQL
– RData
30
31. Hello, FRED
55,000 economic +me series • Federal Reserve Bank of Kansas • Thomson Reuters/University of
from 45 sources: City Michigan
• Federal Reserve Bank of • U.S. Congress: Congressional
• AutomaMc Data Processing, Inc.
Philadelphia Budget Office
• Banca d'Italia
• Federal Reserve Bank of St. Louis • U.S. Department of Commerce:
• Banco de Mexico Bureau of Economic Analysis
• Freddie Mac
• Bank of Japan • U.S. Department of Commerce:
• Haver AnalyMcs
• Bankrate, Inc. Census Bureau
• InsMtute for Supply Management
• Board of Governors of the • U.S. Department of Energy:
Federal Reserve System • InternaMonal Monetary Fund
Energy InformaMon
• London Bullion Market AdministraMon
• BofA Merrill Lynch
AssociaMon
• BriMsh Bankers' AssociaMon • U.S. Department of Housing and
• NaMonal AssociaMon of Realtors Urban Development
• Central Bank of the Republic of
Turkey • NaMonal Bureau of Economic • U.S. Department of Labor:
Research Bureau of Labor StaMsMcs
• Chicago Board OpMons Exchange
• OrganisaMon for Economic Co-‐ • U.S. Department of Labor:
• CredAbility Nonprofit Credit operaMon and Development Employment and Training
Counseling & EducaMon
• Reserve Bank of Australia AdministraMon
• Deutsche Bundesbank
• Standard and Poor's • U.S. Department of the Treasury:
• Dow Jones & Company Financial Management Service
• Swiss NaMonal Bank
• Eurostat • U.S. Department of
• The White House: Council of
• Federal Financial InsMtuMons Economic Advisors TransportaMon: Federal Highway
ExaminaMon Council AdministraMon
• The White House: Office of
• Federal Housing Finance Agency Management and Budget • Wilshire Associates Incorporated
• Federal Reserve Bank of Chicago • World Bank
31
32. BLS Jobless data (FRED) + S&P (Yahoo!)
library(quantmod)
initial.claims = getSymbols('ICSA', src='FRED', auto.assign=F)
sp500 = getSymbols('^GSPC', src='yahoo', auto.assign=F)
# Convert quotes to weekly and fetch Cl() closing price
sp500.weekly = Cl(to.weekly(sp500))
see R/11-quantmod.R 32
33. Resources
• Expanded code snippets and all data for this talk
– http://bit.ly/pawdata
• R Data Import/Export manual
– http://cran.r-project.org/doc/manuals/R-data.html
• CRAN: Comprehensive R Archive Network
– package lists: http://cran.r-project.org/web/packages/
– Featured: XLConnect, foreign, RMySQL, XML, quantmod, rdatamarket, WDI,
quantmod
– Database: RODBC, DBI, RJDBC, ROracle, RPostgreSQL, RSQLite, RMongo, RCassandra
– Data sets: zipcode, agridat, GANPAdata
– Data access: crn, rgbif, RISmed, govdat, myepisodes, msProstate, corpora
• rhbase from the RHadoop project
– https://github.com/RevolutionAnalytics/RHadoop
33
34. When I first said that R is my “Swiss Army
Knife” for data, you might have pictured this:
36. Thank you!
by Jeffrey Breen
Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata email: jeffrey.breen@thinkbiganalytics.com
blog: http://jeffreybreen.wordpress.com
Twitter: @JeffreyBreen
36