This document discusses winning with big data and data science. It provides 9 tips for working with big data, including choosing the right tools, compressing data, splitting data, working with samples, using statistics, copying from others, avoiding chart typologies, using color carefully, and telling a story. It also presents a success story about using call data and social networks to analyze why telecom customers leave, finding a 700% increase in churn when a cancellation occurs in a call network. The document promotes data science as being a lucrative field and provides contact information for the presenter.
Winning with Big Data: Secrets of the Successful Data ScientistDataspora
A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom.
http://en.oreilly.com/datascience/public/schedule/detail/15316
Winning With Big Data: Secrets of the Successful Data ScientistDataspora
The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights.
Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.
Presented by Scott Stults | OpenSource Connections. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
An introduction to Python tools for data science, presented for the DevX developer club at Carleton College, October 2017. Talk spans:
-Tools for interactive/collaborative programming (Ipython, Jupyter)
- Tools for data wrangling/ analysis (numpy, pandas)
- Tools for visualization (matplotlib, seaborn)
Winning with Big Data: Secrets of the Successful Data ScientistDataspora
A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom.
http://en.oreilly.com/datascience/public/schedule/detail/15316
Winning With Big Data: Secrets of the Successful Data ScientistDataspora
The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights.
Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.
Presented by Scott Stults | OpenSource Connections. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
An introduction to Python tools for data science, presented for the DevX developer club at Carleton College, October 2017. Talk spans:
-Tools for interactive/collaborative programming (Ipython, Jupyter)
- Tools for data wrangling/ analysis (numpy, pandas)
- Tools for visualization (matplotlib, seaborn)
Big data is one of the most popular terms in the IT industry during the past decade. The word is vague and broad enough that essentially every one of us is living in a big-data world. Every time you do a google search, like a post in Facebook, write something in WeChat or view some item on Amazon, you both use and contribute to someone's big data system. Managing so much data across many computers introduce unique challenges. In this talk, we review the landscape of big data platforms and discuss some lessons we learned from building them.
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...DataStax Academy
In this presentation, Ken will describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There's a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus their time and energy on the most at-risk individuals. This project uses Cassandra as the scalable storage system for this social media data, which is then analyzed in a distributed environment using Hadoop. The project also uses the Solr search support from DataStax Enterprise to provide ways for users to dig into the underlying data, which is critical when understanding the assigned risk levels.
Adatao Keynote Address @ UIUC Research Park Big-Data Summit, December 6, 2013
We were invited to give the Keynote address at the UIUC Research Park Big-Data Summit. We talked about (a) Why Big Data, (b) Big-Data Success Factors, and (c) The Future of Big Data. We also showed how Adatao approaches Big Data analysis for business users, via a beautiful, easy-to-use yet powerful, interactive web application.
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Matt Stubbs
Date: 13th November 2018
Location: Data Ops Theatre
Time: 13:50 - 14:20
Speakers: Terry McCann, Adatis & Chris Conroy, Rank Group
About: Rank Group approached Adatis Consulting Ltd in 2017 to help tackle a key issue their data science team were encountering – “How do you to gracefully transition from one machine learning model to another as models are retrained and rewritten?”
Rank Group is the owner of many popular gaming brands in the UK, including Grosvenor Casinos and Mecca Bingo. Rank use Machine Learning to optimise and influence business decisions across their enterprise. Models are deployed to identify customers churn, improve cross sale, enhance retention and most importantly to identify customers who are at risk of having a gambling addiction. These models are constantly being evaluated and retrained as gambling habits change and new games are introduced.
London-based advanced analytics consultancy Adatis implemented a new advanced Machine Learning Model Management service based on the "Rendezvous" Architecture created by Ellen Friedman and Ted Dunning. Rendezvous handles the distribution of a single request to multiple models, scoring all in parallel, then decides on the most appropriate output to return. This is a massively scalable, flexible architecture that solves one of the key problems encountered by Data Science teams today. In this session we will look at the original problem and the architecture which was used to solve it.
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY
Because every organization produces and propagates data as part of their day-to-day operations, data trends are becoming more and more important in the mainstream business world’s consciousness. For many organizations in various industries, though, comprehension of this development begins and ends with buzzwords: “Big Data,” “NoSQL,” “Data Scientist,” and so on. Few realize that any and all solutions to their business problems, regardless of platform or relevant technology, rely to a critical extent on the data model supporting them. As such, Data Modeling is not an optional task for an organization’s data effort, but rather a vital activity that facilitates the solutions driving your business.
Instead of the technical minutiae of Data Modeling, this webinar will focus on its value and practicality for your organization. In doing so, we will:
Address fundamental Data Modeling methodologies, their differences and various practical applications, and trends around the practice of Data Modeling itself
Discuss abstract models and entity frameworks, as well as some basic tenets for application development
Examine the general shift from segmented Data Modeling to more business-integrated practices
Discuss fundamental Data Modeling concepts based on “The DAMA Guide to the Data Management Body of Knowledge” (DAMA DMBOK)
Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata. We have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases. This process it's a simple and straight forward solution that solves the translation of a Graph database with multiple and different structured entities to a Graph library, and the problem of querying a massive database without timeouts.
Find here the complete talk: https://www.youtube.com/watch?v=vucXQwEhpfw
Speeding Up Data Science: From a Data Management PerspectiveJiannan Wang
My lab is part of the SFU Data Science Research Group (https://data.cs.sfu.ca/). My lab's mission is to speed up data science. We develop innovative technologies and open-source tools for data scientists such that they can turn raw data into actionable insights in a more efficient manner.
Interview questions on Apache spark [part 2]knowbigdata
This is Apache Spark Question & Answer Tutorial.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Big data is one of the most popular terms in the IT industry during the past decade. The word is vague and broad enough that essentially every one of us is living in a big-data world. Every time you do a google search, like a post in Facebook, write something in WeChat or view some item on Amazon, you both use and contribute to someone's big data system. Managing so much data across many computers introduce unique challenges. In this talk, we review the landscape of big data platforms and discuss some lessons we learned from building them.
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...DataStax Academy
In this presentation, Ken will describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There's a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus their time and energy on the most at-risk individuals. This project uses Cassandra as the scalable storage system for this social media data, which is then analyzed in a distributed environment using Hadoop. The project also uses the Solr search support from DataStax Enterprise to provide ways for users to dig into the underlying data, which is critical when understanding the assigned risk levels.
Adatao Keynote Address @ UIUC Research Park Big-Data Summit, December 6, 2013
We were invited to give the Keynote address at the UIUC Research Park Big-Data Summit. We talked about (a) Why Big Data, (b) Big-Data Success Factors, and (c) The Future of Big Data. We also showed how Adatao approaches Big Data analysis for business users, via a beautiful, easy-to-use yet powerful, interactive web application.
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Matt Stubbs
Date: 13th November 2018
Location: Data Ops Theatre
Time: 13:50 - 14:20
Speakers: Terry McCann, Adatis & Chris Conroy, Rank Group
About: Rank Group approached Adatis Consulting Ltd in 2017 to help tackle a key issue their data science team were encountering – “How do you to gracefully transition from one machine learning model to another as models are retrained and rewritten?”
Rank Group is the owner of many popular gaming brands in the UK, including Grosvenor Casinos and Mecca Bingo. Rank use Machine Learning to optimise and influence business decisions across their enterprise. Models are deployed to identify customers churn, improve cross sale, enhance retention and most importantly to identify customers who are at risk of having a gambling addiction. These models are constantly being evaluated and retrained as gambling habits change and new games are introduced.
London-based advanced analytics consultancy Adatis implemented a new advanced Machine Learning Model Management service based on the "Rendezvous" Architecture created by Ellen Friedman and Ted Dunning. Rendezvous handles the distribution of a single request to multiple models, scoring all in parallel, then decides on the most appropriate output to return. This is a massively scalable, flexible architecture that solves one of the key problems encountered by Data Science teams today. In this session we will look at the original problem and the architecture which was used to solve it.
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY
Because every organization produces and propagates data as part of their day-to-day operations, data trends are becoming more and more important in the mainstream business world’s consciousness. For many organizations in various industries, though, comprehension of this development begins and ends with buzzwords: “Big Data,” “NoSQL,” “Data Scientist,” and so on. Few realize that any and all solutions to their business problems, regardless of platform or relevant technology, rely to a critical extent on the data model supporting them. As such, Data Modeling is not an optional task for an organization’s data effort, but rather a vital activity that facilitates the solutions driving your business.
Instead of the technical minutiae of Data Modeling, this webinar will focus on its value and practicality for your organization. In doing so, we will:
Address fundamental Data Modeling methodologies, their differences and various practical applications, and trends around the practice of Data Modeling itself
Discuss abstract models and entity frameworks, as well as some basic tenets for application development
Examine the general shift from segmented Data Modeling to more business-integrated practices
Discuss fundamental Data Modeling concepts based on “The DAMA Guide to the Data Management Body of Knowledge” (DAMA DMBOK)
Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata. We have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases. This process it's a simple and straight forward solution that solves the translation of a Graph database with multiple and different structured entities to a Graph library, and the problem of querying a massive database without timeouts.
Find here the complete talk: https://www.youtube.com/watch?v=vucXQwEhpfw
Speeding Up Data Science: From a Data Management PerspectiveJiannan Wang
My lab is part of the SFU Data Science Research Group (https://data.cs.sfu.ca/). My lab's mission is to speed up data science. We develop innovative technologies and open-source tools for data scientists such that they can turn raw data into actionable insights in a more efficient manner.
Interview questions on Apache spark [part 2]knowbigdata
This is Apache Spark Question & Answer Tutorial.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
32. THE BIG DATA STACK Actions Data Products (Content Filters, Rec Engines) Analytics (R, SPSS, SAS, SAP) Insights Big Data Dedicated RDBMS Data
33. THANKS! QUESTIONS? Michael Driscoll med@dataspora.com @dataspora on Twitter http://www.dataspora.com/blog SDForum BI SIG June 15, 2010
Editor's Notes
I’m Mike Driscoll, founder of Dataspora LLC, we’re a boutique analytics firm based in San Francisco.Before coming out to the Bay Area, I worked on the human genome project & got a doctorate in Computational Biology.Today I’m going to talk about Big Data, Data Science, and some tips for the Data Scientist.
If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.We live in a world exploding with data. In any given minute, databases somewhere are tracking mouse clicks on web sites, point of sale purchases, rider swipes through subway turnstyles, physician prescriptions, digital video recorder rewinds, and the location of every GPS-enabled car and phone on the planet.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.So the world is streaming billions of data points per minute. This is Big Data – capital B, capital D. Ben Lorica of O’Reilly Media has said Big Data is “data that you have to think about” when storing, analyzing or otherwise grappling with it.But capturing data isn’t enough. We need tools to make sense of it.At Facebook, they call their data analysts, ‘data scientists’. I like this term, because it captures the point of collecting this data: testing hypotheses about the world.And to test hypotheses using Big Data, we need statistics.
In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.I’ll talk about tools for munging; the answers to these questions are
Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
This is the essence of parallelism, and in fact, of big data: the key is to some independent dimension on which to split your data.Otherwise everything sits together, in a monolithic file system, database, or data store -- which often spells disaster.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn & understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?There’s also something to be said for significant but so small in magnitude as to be meaningless.(I once sat through a heart drug presentation, which showed a significant but inconsequential difference versus Aspirin. The price differential was not inconsequential, however).
Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead & use someone else’s stuff. Go ahead. It’s there. Just today I cribbed code from StackOverflow to make a heatmap in R.That what’s great about R. 2000 statistical libraries written by professors.
Not machines, people.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
Not machines, people.
This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Windowing functions in Greenplum, which is a modified Postgres distributed database.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
The stack is loosely coupled: right tool for the right job. No one firm can do it all.- There aren’t – not yet at least – out of the box solutions for getting through this: the data scientists occupy the middle.Big Data is disrupting this entire stack: -- at the bottom, new DB firms like Aster-- in the middle, the same revo-You know who sits on the top of that stack? We do. That’s why storytelling is such an important skill.