Driscoll bi sig_15_jun2010

•Download as PPTX, PDF•

1 like•1,503 views

This document discusses winning with big data and data science. It provides 9 tips for working with big data, including choosing the right tools, compressing data, splitting data, working with samples, using statistics, copying from others, avoiding chart typologies, using color carefully, and telling a story. It also presents a success story about using call data and social networks to analyze why telecom customers leave, finding a 700% increase in churn when a cancellation occurs in a call network. The document promotes data science as being a lucrative field and provides contact information for the presenter.

Technology

WINNING WITH BIG DATA Secrets of the Successful Data Scientist SDForum BI SIG June 15, 2010 Michael Driscoll @dataspora

WHAT IS BIG DATA? Data that is distributed.

“The sexy job in the next ten years will be statisticians…” - Hal Varian = +

1. CHOOSE THE RIGHT TOOL You don’t need a chainsaw to cut butter.

2. COMPRESS EVERYTHING mysqldump -u myuser -p mypasssourceDB | br />gzip | ssh mike@dataspora.com "cat - | br />gunzip | mysql -u myuser -p mypasstargetDB" The world is IO-bound.

3. SPLIT UP YOUR DATA Split, apply, combine.

4. WORK WITH SAMPLES perl -ne "print if (rand() < 0.01)" data.csv > sample.csv Big Data is heavy, samples are light.

COPY FROM OTHERS git clone git://github.com/kevinweil/hadoop-lzo Use open source.

7. ESCHEW CHART TYPOLOGIES Charts are compositions, not containers.

8. COLORWITH CARE Color can enhance or insult.

WHY DO TELCO CUSTOMERS LEAVE? Sign up Leave Goal: “less churn.”

DATA: BILLIONS OF CALLS … and millions of callers.

DOES CALL QUALITY MATTER? … a difference, but not significant.

BUILD THE CALL GRAPH … but is it predictive?

700% INCREASE IN CHURN when a cancellation occurs in a call network.

THE BIG DATA STACK Actions Data Products (Content Filters, Rec Engines) Analytics (R, SPSS, SAS, SAP) Insights Big Data Dedicated RDBMS Data

THANKS! QUESTIONS? Michael Driscoll med@dataspora.com @dataspora on Twitter http://www.dataspora.com/blog SDForum BI SIG June 15, 2010

A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom. http://en.oreilly.com/datascience/public/schedule/detail/15316

Winning With Big Data: Secrets of the Successful Data Scientist

Dataspora

The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights. Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.

A Hadoop Primer

sogrady

Introduction of Big data and Hadoop

Arohi Khandelwal

Indexing big data in the cloud

lucenerevolution

Presented by Scott Stults | OpenSource Connections. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

DevX: Python for Data Science

DustinMichels

Introduction to Microsoft Hadoop

Boise Web Technologies Group

Frequent itemset mining_on_hadoop

SWAMI06

Big data is one of the most popular terms in the IT industry during the past decade. The word is vague and broad enough that essentially every one of us is living in a big-data world. Every time you do a google search, like a post in Facebook, write something in WeChat or view some item on Amazon, you both use and contribute to someone's big data system. Managing so much data across many computers introduce unique challenges. In this talk, we review the landscape of big data platforms and discuss some lessons we learned from building them.

Suicide Risk Prediction Using Social Media and Cassandra

Ken Krugler

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...

DataStax Academy

In this presentation, Ken will describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There's a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus their time and energy on the most at-risk individuals. This project uses Cassandra as the scalable storage system for this social media data, which is then analyzed in a distributed environment using Hadoop. The project also uses the Solr search support from DataStax Enterprise to provide ways for users to dig into the underlying data, which is critical when understanding the assigned risk levels.

Sql saturday el salvador 2016 - Me, A Data Scientist?

Fabricio Quintanilla

Big Data, Big Opportunities

Arimo, Inc.

Adatao Keynote Address @ UIUC Research Park Big-Data Summit, December 6, 2013 We were invited to give the Keynote address at the UIUC Research Park Big-Data Summit. We talked about (a) Why Big Data, (b) Big-Data Success Factors, and (c) The Future of Big Data. We also showed how Adatao approaches Big Data analysis for business users, via a beautiful, easy-to-use yet powerful, interactive web application.

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...

Matt Stubbs

Date: 13th November 2018 Location: Data Ops Theatre Time: 13:50 - 14:20 Speakers: Terry McCann, Adatis & Chris Conroy, Rank Group About: Rank Group approached Adatis Consulting Ltd in 2017 to help tackle a key issue their data science team were encountering – “How do you to gracefully transition from one machine learning model to another as models are retrained and rewritten?” Rank Group is the owner of many popular gaming brands in the UK, including Grosvenor Casinos and Mecca Bingo. Rank use Machine Learning to optimise and influence business decisions across their enterprise. Models are deployed to identify customers churn, improve cross sale, enhance retention and most importantly to identify customers who are at risk of having a gambling addiction. These models are constantly being evaluated and retrained as gambling habits change and new games are introduced. London-based advanced analytics consultancy Adatis implemented a new advanced Machine Learning Model Management service based on the "Rendezvous" Architecture created by Ellen Friedman and Ted Dunning. Rendezvous handles the distribution of a single request to multiple models, scoring all in parallel, then decides on the most appropriate output to return. This is a massively scalable, flexible architecture that solves one of the key problems encountered by Data Science teams today. In this session we will look at the original problem and the architecture which was used to solve it.

Horizon 20110928

Mike Miller

Literacy in the Age of Big Data

Centre for Advanced Management Education

Biq query devfest2017_slides

getdinesh

Big data

FACTS Computer Software L.L.C

Big Data By Vijay Bhaskar Semwal

IIIT Allahabad

Data-Ed Webinar: Data Modeling Fundamentals

DATAVERSITY

Because every organization produces and propagates data as part of their day-to-day operations, data trends are becoming more and more important in the mainstream business world’s consciousness. For many organizations in various industries, though, comprehension of this development begins and ends with buzzwords: “Big Data,” “NoSQL,” “Data Scientist,” and so on. Few realize that any and all solutions to their business problems, regardless of platform or relevant technology, rely to a critical extent on the data model supporting them. As such, Data Modeling is not an optional task for an organization’s data effort, but rather a vital activity that facilitates the solutions driving your business. Instead of the technical minutiae of Data Modeling, this webinar will focus on its value and practicality for your organization. In doing so, we will: Address fundamental Data Modeling methodologies, their differences and various practical applications, and trends around the practice of Data Modeling itself Discuss abstract models and entity frameworks, as well as some basic tenets for application development Examine the general shift from segmented Data Modeling to more business-integrated practices Discuss fundamental Data Modeling concepts based on “The DAMA Guide to the Data Management Body of Knowledge” (DAMA DMBOK)

Multiplatform solution for graph datasources

Javier Domínguez Montes

Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata. We have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases. This process it's a simple and straight forward solution that solves the translation of a Graph database with multiple and different structured entities to a Graph library, and the problem of querying a massive database without timeouts. Find here the complete talk: https://www.youtube.com/watch?v=vucXQwEhpfw

Data Driven Economy @CMU

Komes Chandavimol

Data infrastructure architecture for medium size organization: tips for colle...

DataWorks Summit/Hadoop Summit

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...

Alex Liu

Satyam open analytics nycOpen Analytics

Speeding Up Data Science: From a Data Management Perspective

Jiannan Wang

Big data, why careDaan Gerits

Interview questions on Apache spark [part 2]

knowbigdata

This is Apache Spark Question & Answer Tutorial. We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments. To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Similar to Driscoll bi sig_15_jun2010

Big Data Platform Landscape by 2017

Donghui Zhang

Suicide Risk Prediction Using Social Media and Cassandra

Ken Krugler

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...

DataStax Academy

Sql saturday el salvador 2016 - Me, A Data Scientist?

Fabricio Quintanilla

Big Data, Big Opportunities

Arimo, Inc.

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...

Matt Stubbs

Horizon 20110928

Mike Miller

Literacy in the Age of Big Data

Centre for Advanced Management Education

Biq query devfest2017_slides

getdinesh

Big data

FACTS Computer Software L.L.C

Big Data By Vijay Bhaskar Semwal

IIIT Allahabad

Data-Ed Webinar: Data Modeling Fundamentals

DATAVERSITY

Multiplatform solution for graph datasources

Javier Domínguez Montes

Data Driven Economy @CMU

Komes Chandavimol

Data infrastructure architecture for medium size organization: tips for colle...

DataWorks Summit/Hadoop Summit

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...

Alex Liu

Satyam open analytics nycOpen Analytics

Speeding Up Data Science: From a Data Management Perspective

Jiannan Wang

Big data, why careDaan Gerits

Interview questions on Apache spark [part 2]

knowbigdata

Similar to Driscoll bi sig_15_jun2010 (20)

Big Data Platform Landscape by 2017

Suicide Risk Prediction Using Social Media and Cassandra

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...

Sql saturday el salvador 2016 - Me, A Data Scientist?

Big Data, Big Opportunities

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...

Horizon 20110928

Literacy in the Age of Big Data

Biq query devfest2017_slides

Big data

Big Data By Vijay Bhaskar Semwal

Data-Ed Webinar: Data Modeling Fundamentals

Multiplatform solution for graph datasources

Data Driven Economy @CMU

Data infrastructure architecture for medium size organization: tips for colle...

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...

Satyam open analytics nyc

Speeding Up Data Science: From a Data Management Perspective

Big data, why care

Interview questions on Apache spark [part 2]

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Knowledge engineering: from people to machines and back

Elena Simperl

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events

Connector Corner: Automate dynamic content and events by pushing a button

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

UiPath Test Automation using UiPath Test Suite series, part 3

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Leading Change strategies and insights for effective change management pdf 1.pdf

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Knowledge engineering: from people to machines and back

How world-class product teams are winning in the AI era by CEO and Founder, P...

PCI PIN Basics Webinar from the Controlcase Team

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Epistemic Interaction - tuning interfaces to provide information for AI support

Mission to Decommission: Importance of Decommissioning Products to Increase E...

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Driscoll bi sig_15_jun2010

1. WINNING WITH BIG DATA Secrets of the Successful Data Scientist SDForum BI SIG June 15, 2010 Michael Driscoll @dataspora

2. WHY DATA MATTERS NOW

3. THE INDUSTRIAL AGE OF DATA

4. WHAT IS BIG DATA? Data that is distributed.

5. WHAT IS DATA SCIENCE?

6. WHY DATA SCIENCE IS SEXY

7. “The sexy job in the next ten years will be statisticians…” - Hal Varian = +

9. data model 1000 bytes 2 bytes

10. 9 WAYS TO WIN WITH DATA

11. 1. CHOOSE THE RIGHT TOOL You don’t need a chainsaw to cut butter.

12. 2. COMPRESS EVERYTHING mysqldump -u myuser -p mypasssourceDB | br />gzip | ssh mike@dataspora.com "cat - | br />gunzip | mysql -u myuser -p mypasstargetDB" The world is IO-bound.

13. 3. SPLIT UP YOUR DATA Split, apply, combine.

14. 4. WORK WITH SAMPLES perl -ne "print if (rand() < 0.01)" data.csv > sample.csv Big Data is heavy, samples are light.

15. 5. USE STATISTICS

16. COPY FROM OTHERS git clone git://github.com/kevinweil/hadoop-lzo Use open source.

17. 7. ESCHEW CHART TYPOLOGIES Charts are compositions, not containers.

18. 8. COLORWITH CARE Color can enhance or insult.

19. 9. TELL A STORY People are listening.

20. ONE SUCCESS STORY

21. WHY DO TELCO CUSTOMERS LEAVE? Sign up Leave Goal: “less churn.”

22. DATA: BILLIONS OF CALLS … and millions of callers.

23. DOES CALL QUALITY MATTER? … a difference, but not significant.

24. WHAT ABOUT SOCIAL NETWORKS? Hmmm...

25. BUILD THE CALL GRAPH … but is it predictive?

26. EVOLUTION OF A CALL GRAPH April

27. EVOLUTION OF A CALL GRAPH May

28. EVOLUTION OF A CALL GRAPH June

29. EVOLUTION OF A CALL GRAPH July

30. 700% INCREASE IN CHURN when a cancellation occurs in a call network.

31. FINAL THOUGHTS

32. THE BIG DATA STACK Actions Data Products (Content Filters, Rec Engines) Analytics (R, SPSS, SAS, SAP) Insights Big Data Dedicated RDBMS Data

33. THANKS! QUESTIONS? Michael Driscoll med@dataspora.com @dataspora on Twitter http://www.dataspora.com/blog SDForum BI SIG June 15, 2010

Editor's Notes

I’m Mike Driscoll, founder of Dataspora LLC, we’re a boutique analytics firm based in San Francisco.Before coming out to the Bay Area, I worked on the human genome project & got a doctorate in Computational Biology.Today I’m going to talk about Big Data, Data Science, and some tips for the Data Scientist.
If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.We live in a world exploding with data. In any given minute, databases somewhere are tracking mouse clicks on web sites, point of sale purchases, rider swipes through subway turnstyles, physician prescriptions, digital video recorder rewinds, and the location of every GPS-enabled car and phone on the planet.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.So the world is streaming billions of data points per minute. This is Big Data – capital B, capital D. Ben Lorica of O’Reilly Media has said Big Data is “data that you have to think about” when storing, analyzing or otherwise grappling with it.But capturing data isn’t enough. We need tools to make sense of it.At Facebook, they call their data analysts, ‘data scientists’. I like this term, because it captures the point of collecting this data: testing hypotheses about the world.And to test hypotheses using Big Data, we need statistics.
In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.I’ll talk about tools for munging; the answers to these questions are
Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
This is the essence of parallelism, and in fact, of big data: the key is to some independent dimension on which to split your data.Otherwise everything sits together, in a monolithic file system, database, or data store -- which often spells disaster.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn & understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?There’s also something to be said for significant but so small in magnitude as to be meaningless.(I once sat through a heart drug presentation, which showed a significant but inconsequential difference versus Aspirin. The price differential was not inconsequential, however).
Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead & use someone else’s stuff. Go ahead. It’s there. Just today I cribbed code from StackOverflow to make a heatmap in R.That what’s great about R. 2000 statistical libraries written by professors.
Not machines, people.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
Not machines, people.
This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Windowing functions in Greenplum, which is a modified Postgres distributed database.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
The stack is loosely coupled: right tool for the right job. No one firm can do it all.- There aren’t – not yet at least – out of the box solutions for getting through this: the data scientists occupy the middle.Big Data is disrupting this entire stack: -- at the bottom, new DB firms like Aster-- in the middle, the same revo-You know who sits on the top of that stack? We do. That’s why storytelling is such an important skill.

Driscoll bi sig_15_jun2010

Recommended

Recommended

More Related Content

Similar to Driscoll bi sig_15_jun2010

Similar to Driscoll bi sig_15_jun2010 (20)

Recently uploaded

Recently uploaded (20)

Driscoll bi sig_15_jun2010

Editor's Notes