My Talk at GCPUG-Taiwan on 2015/5/8.
You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with.
One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent.
In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Note: this is a placeholder for the presentation next Tuesday at the Percona Live London
My Talk at GCPUG-Taiwan on 2015/5/8.
You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with.
One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent.
In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Note: this is a placeholder for the presentation next Tuesday at the Percona Live London
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Covering how to work on migrating, backups and archive your important data to GCS:
1. Copying/Migrating Data.
2. Object Composition
3. Durable Reduced Availability Storage
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and TelegrafInfluxData
Did you know you can use InfluxDB to monitor your BBQ and to ensure the tastiest results? Join this meetup to learn two different approaches to using a time series database to monitor a BBQ or a smoker. Learn how Will Cooke uses Python, MQTT, Telegraf and InfluxDB 2.0 to monitor his smoker and to gain insight into temperature changes, the stall, and other important stats about his brisket. Scott Anderson will demonstrate how he uses a FireBoard wireless thermometer, Telegraf and InfluxDB 2.0 to continuously work towards the perfect smoke.
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
Slides from the 2017 FOSS4G Workshop "Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS"
See the repository at https://github.com/lossyrob/foss4g-2017-geopyspark-workshop
codecentric AG: Using Cassandra and Clojure for Data Crunching backendsDataStax Academy
Choosing tooling for Data Crunching and Analytics is no easy task. Understanding the way Cassandra works and treats the data can open up a lot of opportunities for optimization in your back-end. Learn how we developed a Smart Platform for complex, multidimensional analytics using the best available tools.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Taming the Tiger: Tips and Tricks for Using TelegrafInfluxData
Taming the Tiger: Tips and Tricks for Using Telegraf
As part of InfluxDays North America 2020 Virtual Experience, the Technical Services team will be offering a free live InfluxDB training to the first 100 registered attendees.This will be hosted over Zoom and Slack with two main trainers and there will be assistants to help participants with the course work. The training will be recorded and made available on the InfluxDays website and the InfluxData YouTube channel.
The course provides an introduction to using Telegraf within a hands-on lab setting. Attendees will be presented a series of lab exercises and get the chance to work through them with the assistance of our remote proctors. After taking this class, attendants will be able to:
Articulate the purposes and value of Telegraf
Understand the basics of configuring and running Telegraf
Understand how to manipulate incoming data to optimize InfluxDB schema
Visualize the insertion results using InfluxDB Cloud UI
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...Citus Data
Postgres continues to get more and more feature rich. But equally as impressive is the network of extensions that are growing around Postgres. With the rich extension APIs you can now add advanced functionality to Postgres without having to fork the codebase or wait for the main PostgreSQL release cycle. In this talk we'll cover some of the basics of what an extension is and then take a tour through a variety of Postgres extensions including:
pg_stat_statments
PostGIS
HyperLogLog and TopN
Timescale
pg_partman
Citus
Foreign data wrappers which are their own whole class
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
Giraffe is the open source React-based visualization library that powers data visualizations in the InfluxDB 2.0 UI. Giraffe can be used to display your data within your own app and is Fluxlang-supported! It uses algorithms to handle visualizing high volumes of time series data that InfluxDB can ingest and query.
Kristina Robinson, the engineering manager for the Giraffe team at InfluxData, will dive into:
The basics of using the Giraffe library including how to query your data with Flux
Specific Giraffe visualization types for dashboards (e.g. single number, table and graph)
How to incorporate visualizations in your own custom apps
Do you gather metrics from your application? Can you combine them and easily generate custom graphs out of them? Can your developers measure whatever they want at any point of your application without breaking it or making it slower?
In our next itnig friday, Víctor Martínez will show us how easy it is to roll on your own Graphite installation and how to use Etsy's statsd collector to flush your metrics. You will learn what Graphite is, how all of its components work, how to get your real time&historic metrics into Carbon, Graphite's database, and how to plot them in different manners. Víctor will show us some Graphite dashboards, alternative statds implementations, detailed common Graphite configuration gotchas, design limitations and how to deal with them.
<a>Visit details</a>
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
En esta charla miraremos al futuro introduciendo Spark como alternativa al clásico motor de Hadoop MapReduce. Describiremos las diferencias más importantes frente al mismo, se detallarán los componentes principales que componen el ecosistema Spark, e introduciremos conceptos básicos que permitan empezar con el desarrollo de aplicaciones básicas sobre el mismo.
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Covering how to work on migrating, backups and archive your important data to GCS:
1. Copying/Migrating Data.
2. Object Composition
3. Durable Reduced Availability Storage
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and TelegrafInfluxData
Did you know you can use InfluxDB to monitor your BBQ and to ensure the tastiest results? Join this meetup to learn two different approaches to using a time series database to monitor a BBQ or a smoker. Learn how Will Cooke uses Python, MQTT, Telegraf and InfluxDB 2.0 to monitor his smoker and to gain insight into temperature changes, the stall, and other important stats about his brisket. Scott Anderson will demonstrate how he uses a FireBoard wireless thermometer, Telegraf and InfluxDB 2.0 to continuously work towards the perfect smoke.
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
Slides from the 2017 FOSS4G Workshop "Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS"
See the repository at https://github.com/lossyrob/foss4g-2017-geopyspark-workshop
codecentric AG: Using Cassandra and Clojure for Data Crunching backendsDataStax Academy
Choosing tooling for Data Crunching and Analytics is no easy task. Understanding the way Cassandra works and treats the data can open up a lot of opportunities for optimization in your back-end. Learn how we developed a Smart Platform for complex, multidimensional analytics using the best available tools.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Taming the Tiger: Tips and Tricks for Using TelegrafInfluxData
Taming the Tiger: Tips and Tricks for Using Telegraf
As part of InfluxDays North America 2020 Virtual Experience, the Technical Services team will be offering a free live InfluxDB training to the first 100 registered attendees.This will be hosted over Zoom and Slack with two main trainers and there will be assistants to help participants with the course work. The training will be recorded and made available on the InfluxDays website and the InfluxData YouTube channel.
The course provides an introduction to using Telegraf within a hands-on lab setting. Attendees will be presented a series of lab exercises and get the chance to work through them with the assistance of our remote proctors. After taking this class, attendants will be able to:
Articulate the purposes and value of Telegraf
Understand the basics of configuring and running Telegraf
Understand how to manipulate incoming data to optimize InfluxDB schema
Visualize the insertion results using InfluxDB Cloud UI
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...Citus Data
Postgres continues to get more and more feature rich. But equally as impressive is the network of extensions that are growing around Postgres. With the rich extension APIs you can now add advanced functionality to Postgres without having to fork the codebase or wait for the main PostgreSQL release cycle. In this talk we'll cover some of the basics of what an extension is and then take a tour through a variety of Postgres extensions including:
pg_stat_statments
PostGIS
HyperLogLog and TopN
Timescale
pg_partman
Citus
Foreign data wrappers which are their own whole class
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
Giraffe is the open source React-based visualization library that powers data visualizations in the InfluxDB 2.0 UI. Giraffe can be used to display your data within your own app and is Fluxlang-supported! It uses algorithms to handle visualizing high volumes of time series data that InfluxDB can ingest and query.
Kristina Robinson, the engineering manager for the Giraffe team at InfluxData, will dive into:
The basics of using the Giraffe library including how to query your data with Flux
Specific Giraffe visualization types for dashboards (e.g. single number, table and graph)
How to incorporate visualizations in your own custom apps
Do you gather metrics from your application? Can you combine them and easily generate custom graphs out of them? Can your developers measure whatever they want at any point of your application without breaking it or making it slower?
In our next itnig friday, Víctor Martínez will show us how easy it is to roll on your own Graphite installation and how to use Etsy's statsd collector to flush your metrics. You will learn what Graphite is, how all of its components work, how to get your real time&historic metrics into Carbon, Graphite's database, and how to plot them in different manners. Víctor will show us some Graphite dashboards, alternative statds implementations, detailed common Graphite configuration gotchas, design limitations and how to deal with them.
<a>Visit details</a>
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
En esta charla miraremos al futuro introduciendo Spark como alternativa al clásico motor de Hadoop MapReduce. Describiremos las diferencias más importantes frente al mismo, se detallarán los componentes principales que componen el ecosistema Spark, e introduciremos conceptos básicos que permitan empezar con el desarrollo de aplicaciones básicas sobre el mismo.
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.
Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.
One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.
GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Les index columnstore sont apparus avec SQL Server 2012 et bon nombre de limitations ou d'améliorations ont vu le jour avec SQL Server 2014 et bientôt SQL Server 2016. Il en va de même pour les tables In-Memory à partir de SQL Server 2014. Découvrez lors de cette session comment SQL Server 2016 répond aux besoins d'analyse opérationnelle en temps réel en introduisant et en mixant ces 2 technologies In-Memory
Introduces important facts and tools to help you get starting with performance improvement.
Learn to monitor and analyze important metrics, then you can start digging and improving.
Includes useful munin probes, predefined SQL queries to investigate your database's performance, and a top 5 of the most common performance problems in custom Apps.
By Olivier Dony - Lead Developer & Community Manager, OpenERP
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
The Tidyverse and the Future of the Monitoring ToolchainJohn Rauser
As delivered at Monitorama PDX 2017.
Thesis: The ideas in the tidyverse, not the tools themselves but the fundamental ideas they are built upon, those ideas are going to have profound impact on everything having to do with data manipulation and visualization or data analysis.
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.
Improve your SQL workload with observabilityOVHcloud
How to see everything in our perimeter? Better yet, how to make sure everyone can follow the activity of their databases? Developers are not used to watch (closely) their databases performance, how to share this knowledge and make it accessible? This is the challenge we have set ourselves. One year later we can share our experience. Scaling is not always throwing resource on a design issue. What if observability was not just a buzzword, but had a real impact on production?
Similar to Trading volume mapping R in recent environment (20)
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
2. Motivation of this Talk
• is rapidly evolving in recent years.
• Greate packages are appearing one after
another!!!
• Needless to say, googleVis too
• Let me show how to use these for data
analysis in this presentation
1. Data manipulation by dplyr
2. Visualization by rMaps
2
3. Libraries to be needed
3
library(data.table)
library(rMaps)
library(dplyr)
library(magrittr)
library(countrycode)
library(xts)
library(pings)
6. Fantastic collaboration -dplyr, magrittr, pings-
• You can chain commands with
forward-pipe operator %>%
(magrittr)
• Data manipulation like “mutate”,
“group_by”, “summarize” (dplyr)
• Soundlize(?) data manipulation
(pings)
6
8. You can write like this(dplyr+magritter):
8
iris %>%
#add new column “width”
mutate(Width=Sepal.Width+Petal.Width) %>%
#grouping data by species
group_by(Species) %>%
#calculate mean value of Width column
summarize(AverageWidth=mean(Width)) %>%
#Extract only the column “AverageWidth”
use_series(AverageWidth) %>%
#Dvide AverageWidth value by 3
divide_by(3) %>%
#Get the maximum value of AverageWidth/3
max
9. You can write like this(dplyr+magritter+pings):
9
pings(iris %>%
#add new column “width”
mutate(Width=Sepal.Width+Petal.Width) %>%
#grouping data by species
group_by(Species) %>%
#calculate mean value of Width column
summarize(AverageWidth=mean(Width)) %>%
#Extract only the column “AverageWidth”
use_series(AverageWidth) %>%
#Dvide AverageWidth value by 3
divide_by(3) %>%
#Get the maximum value of AverageWidth/3
max)
12. Data for analysis….
• I prepared trading volume data
• It is already formed
• I used “data.table” package which
gives us the function to fast speed
data loading(fread function)
12
13. Data for analysis…
13
> str(x)
Classes ‘data.table’ and 'data.frame':
21245 obs. of 5 variables:
$ Date : Date, format: "2012-11-01" "2012-11-
01" ...
$ User_Country: chr "AR" "AT" "AU" "BD" ...
$ Amount : num 775582 931593 565871 566 7986 ...
$ ISO3C : chr "ARG" "AUT" "AUS" "BGD" ...
$ Date2 : num 15645 15645 15645 15645 15645 ...
- attr(*, ".internal.selfref")=
> head(x)
Date User_Country Amount ISO3C Date2
1 2012-11-01 AR 775581.543 ARG 15645
2 2012-11-01 AT 931592.986 AUT 15645
3 2012-11-01 AU 565870.994 AUS 15645
4 2012-11-01 BD 565.863 BGD 15645
5 2012-11-01 BE 7985.860 BEL 15645
6 2012-11-01 BG 56863.958 BGR 15645
14. Data processing by dplyr
14
> xs <- x %>%
+ mutate(YearMonth=as.yearmon(Date)) %>%
+ group_by(YearMonth, ISO3C) %>%
+ summarize(Amount=floor(sum(Amount)/10^4))
> head(xs)
YearMonth ISO3C Amount
1 11 2012 ALB 0
2 11 2012 ARE 7
3 11 2012 ARG 647
4 11 2012 AUS 2153
5 11 2012 AUT 503
6 11 2012 BEL 41
16. Data processing by dplyr
• Define some variables we use later
16
> min.date <- xs %>%
+ use_series(YearMonth) %>%
+ as.Date %>% min %>% as.character
> max.counter <- xs %>%
+ use_series(Counter) %>% max
> min.date
[1] "2012-11-01"
> max.counter
[1] 12
17. Visualize with rMaps
• Install and load rMaps
17
library(devtools)
install_github('ramnathv/rCharts@dev')
install_github('ramnathv/rMaps')
library(rMaps)
18. Visualize with rMaps
• Easy to visualize yearly data
• But, we have monthly data
• We need to customize template
HTML and javascript code
• Little bit long code…
18