Successfully reported this slideshow.

Challenges and opportunities in big data


Published on

Caption Transcript from “Challenges and Opportunities in Big Data”

On Thursday, March 29, 2012, from 2-3:45 pm ET, federal government science heads from OSTP, NSF, NIH, DOE, DOD, DARPA and USGS outlined how their agencies are engaged in Big Data research. The event took place at the American Association for the Advancement of Science in Washington, D.C.

An archive of this webcast will be available on within two days of this event.

(I have no ownership of this information nor can I attest to its accuracy since it was transcribed and may have room for error. This information was extracted from immediately after the session concluded)

Event ID: 1921909
Event Started: 3/29/2012 1:47:56 PM ET

Please stand by for real time captions.

I welcome all of you. From the look -- looks of the crowd, big data is a big deal. We think it will be bigger. The Obama administration is announcing new investments in federal agencies. In research and development related to big data. That is in a Norma's the volume of data, and the variety of those data. I am joined by a number of administrative [Indiscernible]. I will introduce them in a moment. Following their brief presentation [Indiscernible] we will have a few minutes for questions. We are happy to have as a moderator from that panel Steve or -- Lour of the New York Times. We are grateful to him. Before I turn things over to my colleagues I would like to say a few words. Why date data and then nonprofit organizations may get involved further in this domain. These data are being [Indiscernible]. Computers running large scales programs.

Recently, the Presidents Council on science and technology, it concluded a report for the president that the federal government is under investing in analyzing sharing and -- what this data. What matters, is our ability to arrive from the new insights to recognize relationships to make accurate prediction. Our ability to move from date it to knowledge to action. As we look at the action in the White House, following the release of the [Indiscernible] report aired became clear for a national big data initiative. The advantages, a cross government focus on big data. One, the domain of big date it will be important from an economic point of view. It is a creation of new IT products. How to use big data to make better decisions. Second, it is critical to accelerating discovery and many domains in science and engineering. Such as in astronomy. Containing hundreds of millions of celestial objects. Working with big data helps with major national challenge, security, health environment and education. Some hospitals are using big data. Startups are beginning with online courses. These kinds of applications are dependent on making the most of information we and most of the world are generating every day. We will take the lead on big data that the government can play an entire -- important role. Using big data approaches to m

  • Be the first to comment

  • Be the first to like this

Challenges and opportunities in big data

  1. 1. Caption Transcript from “Challenges and Opportunities in Big Data”On Thursday, March 29, 2012, from 2-3:45 pm ET, federal government science heads fromOSTP, NSF, NIH, DOE, DOD, DARPA and USGS outlined how their agencies are engaged inBig Data research. The event took place at the American Association for the Advancement ofScience in Washington, D.C.An archive of this webcast will be available on within two days of this event.(I have no ownership of this information nor can I attest to its accuracy since it was transcribedand may have room for error. This information was extracted from immediately after the session concluded)Event ID: 1921909Event Started: 3/29/2012 1:47:56 PM ETPlease stand by for real time captions.I welcome all of you. From the look -- looks of the crowd, big data is a big deal. We think it willbe bigger. The Obama administration is announcing new investments in federal agencies. Inresearch and development related to big data. That is in a Normas the volume of data, and thevariety of those data. I am joined by a number of administrative [Indiscernible]. I will introducethem in a moment. Following their brief presentation [Indiscernible] we will have a few minutesfor questions. We are happy to have as a moderator from that panel Steve or -- Lour of the NewYork Times. We are grateful to him. Before I turn things over to my colleagues I would like tosay a few words. Why date data and then nonprofit organizations may get involved further inthis domain. These data are being [Indiscernible]. Computers running large scales programs.Recently, the Presidents Council on science and technology, it concluded a report for thepresident that the federal government is under investing in analyzing sharing and -- what thisdata. What matters, is our ability to arrive from the new insights to recognize relationships tomake accurate prediction. Our ability to move from date it to knowledge to action. As we look atthe action in the White House, following the release of the [Indiscernible] report aired becameclear for a national big data initiative. The advantages, a cross government focus on big data.One, the domain of big date it will be important from an economic point of view. It is a creationof new IT products. How to use big data to make better decisions. Second, it is critical toaccelerating discovery and many domains in science and engineering. Such as in astronomy.Containing hundreds of millions of celestial objects. Working with big data helps with majornational challenge, security, health environment and education. Some hospitals are using bigdata. Startups are beginning with online courses. These kinds of applications are dependent onmaking the most of information we and most of the world are generating every day. We will takethe lead on big data that the government can play an entire -- important role. Using big dataapproaches to make progress on key national challenges. Challenges -- government challengesas well. Glassware -- last year, we had it enter agency committee on big data. It has alreadyidentify concrete agencies that [Indiscernible] can take. To advance the state of art in big dataheard to solve big problems in science, health security environment and more. We will makemore additional investments. I want to challenge industries, to join with the ministration to makethe most of the extraordinary opportunities created by big data. This is not something the
  2. 2. government can do by itself. We need what the president calls and all hands on deck approach.To have this domain realize. Some companies are already sponsoring big data. Universities arebeginning to create new courses. An entire new courses of study. Organization, like datawithout Borders providing programs and data collection and analysis your --.It is my pleasure to introduce our speakers.I do want to recognize one person that you will see later, Con ONeill are deputy rector. --Let me now turn to Sresh Subra.Thank you John. Today science gathers data from science, but from Spirit -- theoreticalcalculation. At NSF it happens across all fields. At NSF we recognize that data is motivated aprofound transportation -- transformation in culture and conduct in science research. NSF is aproud leader in supporting the fundamental science and infrastructure underlining and enablingthe big data revolution hurt --. Back in the 1980s, we supported the first high performanceshould -- supercomputing centers heard now we are supporting the next generation. At BlueWaters the University of Illinois, one of the most powerful supercomputer in the world capable ofquadrillions of calculations per second just opened two research teams two weeks ago. Weannounce seven new efforts. NSF and NIH have joined together in collaboration in the criticalarea of core techniques and technologies. We will have new knowledge from large data sets. Itwill evaluate new algorithms. Technology and tools to improve data collection and management.In addition to this cross agency solicitation, I am delighted to announce a $10 million expeditionsin computing award to researchers at the University of California, Berkeley. The team willundertake a complete rethinking of data analysis, integrating algorithms, machines, and peopleto develop new ways to turn data into knowledge and insight.Another 1.2 million Another $1.2 million awarded brings together statisticians and biologists todevelop network models and automatic scalable algorithms and tools to tell us more aboutprotein structures and biological pathways. NSF has aggressively embarked on focused crossfunction efforts to address the challenges and seize opportunity best opportunities that big dataenable science effortsWe are calling on members of the science community to use data citation to increaseopportunities for the use and analysis of data sets. This is Tran5s commitment per tosustainability of data. To continue Tran5s role, for the best and brightest scientists andengineers we [Indiscernible] a 21 track.We will announce an opportunity for students is a $2 million research research -- training groupaward. This is for undergraduates. To use graphic and accurate [Indiscernible] .This involves, many NSF directives . To teach and [Indiscernible] learning environments. Thedata-driven [Indiscernible] are bold and cross disciplinary, and cross governmental. We hopethat this work we do today, will lay the groundwork for new enterprises. And to fortify thefoundations of US competitiveness. Thank you so much[ Applause ]
  3. 3. We just described to you this joint effort with NSF and NIH for bold ideas in the management ofbig data aired there may have been a time when biologists were small [Indiscernible] withlavatories cranking out modest volumes of information. Because that is all we had. With thetools available. We now have arrived in the big time. We are capable of producing big data sets.And the need for analysis, we are thrilled to join our colleagues as they -- to tackle these issues.I want to say thanks to Karen Remington, helping us -- how we you can make the most of thisannouncement. Todays advances are driving our kind of science. As I come out of the field ofgenomics -- it will not Sapporos you that I will use that as an example [Indiscernible] your --.Will [Indiscernible] . It just went up on our websites a few minutes ago. A new collaboration. Totry to deal with some of the magnitude of data being produced. We have now formed a classTranII with Amazon. It is up on the cloud. This is a project that aims to sequence 1000sequence, but 17 of those are in hand. I can tell you, it has been a challenge for all of the userswho want access to this information. That was the point of doing the project. To have thataccess. This is 200 terabytes. That is 16 million file cabinets worth, or 30,000 DVDs. Having thisup in the cloud, will provide substantial advantages to users. We are delighted to have[Indiscernible] a couple of years . We are happy to announce this collaboration with Amazon.This will be a launching pad for many kinds of data opportunities. That will include data that willbe coming forward. In large quantities. Especially in cancer research. The Cancer genome is[Indiscernible] breaking 22,000 gene that genomes. What are the driving mutations questionMark anyone working in cancer want access to that data. It is important to us to get the data incontrol. For tools for sharing and managing data. There are other things to mention, national[Indiscernible] . To support through the common port. One is your --. To have medicalinformation shared. The most daunting and most exciting of our [Indiscernible], we need adynamic inventory of our science resources to provide information. We recognize that that dataday increase the technology -- we are pleased that we have highlighted this and brought ittogether. To figure out how we can work, and accept the challenge, embraced the -- challengeof big data. We look forward to that continue collaboration. We are delighted to have a chanceto be part of this this afternoon. Thank you very much.[ Applause ]Good afternoon. The USGS collects the data on water, earthquakes, geology, climate come abiology, and the earth landscape. So much big data it we are in danger of draining -- drowning.Today I would like to tell you what this innervated center is doing to help scientists cometogether in our beautiful center in Fort Collins Colorado. To gather meaning from that largeramount of data. Bringing uncommon bedfellows of scientists. Bringing together economists,biologists -- in government and outside of government to tackle the kinds of problems that arethe headlines of today. By using existing data and rich sources -- to employ new technology toshare and integrate existing data in order to solve those problems. To make progress inscience. It is a proposal driven dross us through peer review. The projects are selected from theJohn Wesley Powell center. There is a process open to select new projects for next year your itwill close on April it will close on April 30. We will also announce the proposals for next year. Iwill announce today. To give you an example of the ongoing projects now, and some of oursuccesses. You will have an idea what the center does hurt weight --One problem is the uncertain climate future. We understand that climate models [Indiscernible].The model is only good as it can protect the past. If they can only per Dick the past within acouple of degrees, because the climate past -- our ability to reconstruct climate situation is onlygood to a couple of Greece, we cannot trust that prediction into the future. We came together at
  4. 4. the center, a used a variety of data, to put together a purse sized reconstruction of the[Indiscernible] thermal optimum ever produced. It will be the standard for all models. So in thefuture, we can per Dick future climate better than before. Another example on -- for ongoing iswater resources. This is a subject of current headlines. People want to know how much water isused -- is it making a dent in my local water supply push Mark -- ?This is something the center is undertaking. A proposal -- if it is selected, I should add, for thisround, the proposals are being supported by the geological survey, but survived -- supported bythe national science foundation. Understanding and managing resilience of global change.Modeling species response to environmental change. Mercouri cycling across North America.Fibrous [Indiscernible] in the US for human health. Modeling of earthquakes and magnitude.Thank you very much. We are very part -- proud to be a part of this initiative.Good afternoon. I would like to start off by thanking you for inviting us. Today the department ofdefense is [Indiscernible] . We are placing a big bet on big data. We change the game[Indiscernible] . To be the first to demonstrate in use [Indiscernible] secure here -- peer to peer.With computer speed, precision and human [Indiscernible] . This will help our analysts to makesense of the huge volume of data that our military [Indiscernible] collects . They will also supportmultiple missions. We see an opportunity to maneuver and understand the environment. Theywill not have to make decisions by themselves. They will know when they can and cannot callupon the human. There is a revolution emerging on using mass data. [Indiscernible] . This has apotential to bring together sensing, perception and decision support. Since the invention ofintegrative circuits -- in 1959 with a single [Indiscernible] transition. Two processors that areembedded in cell phones. No technology has greater impact or scale. Information technology isat the core of defense. We funded [Indiscernible] in the earlydays.. That are accused globally.The department continue investment in 3-D electronics, computing -- will ensure to extend thislegacy. How we employ that capacity and the capability to use data that is being produced.Everyone has it [Indiscernible] . It is remarkable [Indiscernible] transition from the early conceptsa few years ago to concepts [Indiscernible] . That dynamic reasoning, to learn from experiencewith little training and understanding those tools that recognize trends. Adapt to the real world.Without relying on human intervention. This must be within a secure framework, put the trust inthe data. And the human trust in the system will be maintained with the system to communicatevery natural [Indiscernible] and allow users to collaborate and reason with the data. In 1950,Alan turn -- Turnon proposed this concept. Information on these [Indiscernible] limitation isavailable at a new website being launched. We are looking for a generation of new ideas.[Indiscernible] . The --When we lived in caves -- [Indiscernible] lined up a tree to get [Indiscernible] to get a better look.Later it was observation balloons. Rovero -- in recent decades there was [Indiscernible] . Frompaper to hard drives to [Video and audio cutting in and out] that data it collected is oftenimperfect, incomplete and [Video and audio cutting in and out] .The Atlantic Ocean is 350 350 million cubic in volume. It is 100 100 billion billion gallons ofwater. If each gallon of water are presented a [Indiscernible] the Atlantic Ocean could store all ofthe data it generated by the water. [Video and audio cutting in and out] . We need newfundamental approaches to big data. That match the needs of users, adaptable to changingmissions and to perform on a timescale that match [Indiscernible]. We announced that X. data.It is a $25 million program a year. With that program, we seek the equivalent of a radar andoverhead a merger he -- image for big data. To provide the DOD [Video and audio cutting in andout]
  5. 5. One of our roles within the science community am a and one that we are proud, is supportingconstruction of major [Indiscernible] over laboratories. These facilities include Pardo -- large-scale x-ray light sources. And some of the worlds fastest supercomputers. More than 26,000researchers across the nation from the universities to government laboratories make use ofthese facilities each year. In fact, an interesting [Indiscernible] substantially your best. Singleexperiments conducted, can produce terabytes data per day. Estimated terabytes per second. Ithas to reject [Indiscernible] one of 1000 piece of data each nano second. The standard output,and then enter comparison project of terabits today constitutes the fastest collection of datafacilities. The client community expects that [Indiscernible] to be hundred exabytes by 200 -- tostore share and analyze information tiered --. We have been a supporter of innovated research,analysis of extreme data. Storage and visualization technology. One is the height storagesystem. The fast bit data [Indiscernible] used by a major Internet search such -- engine. It is thewinner of 2008 [Video and audio cutting in and out] . To aid the nation site in the analysis and[Indiscernible] . It will develop new and existing technology for big data. We will partner withother teams to ensure that the best up to date technology is used throughout our program. Thiswas case done -- based on peer to peer [Indiscernible]. Leading mathematicians am a andprogram experts from seven universities. It will have a cross range of fields, to probe and minetheir data. I am pleased to say that several members are here today. I would like to thank Ariand Rob and their team. For their leadership in this area. Again we are grateful to [Video andaudio cutting in and out] . Thank you[ Applause ]We have a few moments for questions. If you would like to direct your question to a specificspeaker, please do so. If you do not, I will probably choose.I am just -- Jeff. When did you first realize that government was not doing enough on big data?Was it before that [Indiscernible] report ? Why are -- are not NASA [Video and audio cutting inand out]Many individuals at the thought that the federal government was not coming together. It reallycame together with the be cast -- [Indiscernible] report came out . [Video and audio cutting inand out] NASA and NOHA are not up here [Video and audio cutting in and out] .This is Bob [Indiscernible] from CNRI. I am sure you about about that international[Indiscernible]. Science is not something we do alone in this country. I was wondering how yousee the development of a collaborative environment being developed [Video and audio cuttingin and out]Would you like to answer?We think about competition and collaboration all of the time with all of our at entities. This issomething where it -- it is a [Indiscernible] thinking . [Video and audio cutting in and out] wehave been engaging in with many of our partners. An Antarctic that we have 15 countriesinvolved. It is the same with the Arctic it -- circle. More recently we have been engaging othercountries not just [Indiscernible]. We have biodiversity. For our programs, we have strongcollaborations. The [Video and audio cutting in and out] . There is big data [Video and audiocutting in and out]
  6. 6. On the earthquake model [Video and audio cutting in and out] it was collaborative [Video andaudio cutting in and out] for an example the lodge Sobotka -- large sum Entre.What the laboratory actually does [Video and audio cutting in and out] . Every year or so, theyhave a model comparison. Between the various models around the world. Things like LAC, andthe data from close all over the world. It is a huge network today.When you have public access to data [Video and audio cutting in and out] at the earliestpossible moment. It is critical to make that information available. Other projects [Video andaudio cutting in and out] have been doing, also it involved multiple customers -- countries. It iscritical for data sharing. For the average investigator has access to it.[Indiscernible-low volume]We do understand that. The challenge of doing that was pointed out forcefully in that report.I think it is hot.The fact that you guys are going to introduce [Indiscernible-low volume] there will be jobs. Formany scientist. What are you predicting [Indiscernible-low volume] . With scientists working withdata. Who are looking for work to be able to work in these areas, not just the new generation butthose in the area now looking for work.[Video and audio cutting in and out]This is where the excitement is going to be. [Laughter] . There is a vast quality of [Indiscernible]career path . There are [Indiscernible] . [Video and audio cutting in and out] . We are determinedto provide those training pathways to increase skilled individuals. And also to [Video and audiocutting in and out] . To have programs for individuals that need those skills in MidshipmanCourier pert -- mid-CourierAre shortages is not high performance computers, but rather high-performance people. Wehave training to handle more applications, to help industries get used to this [Video and audiocutting in and out] .Privacy issues with that. [Video and audio cutting in and out] . There is a consumer bill of rights.Could you expand on that? On the new challenges in the new thinking making sure we get thatbenefits and minimize the risks.We know that privacy dealing with it data -- they are thinking about the privacy [Indiscernible].We talk about this in the Council groups . I think you can expect [Video and audio cutting in andout] . Our capacity to keep up with the privacy [Indiscernible] .[Video and audio cutting in and out]To our panel experts [Video and audio cutting in and out][ Applause ]
  7. 7. We have a wonderful panel of experts for you. We have experts from industry and academia.We are fortunate to have Daphne Koehler, from the University [Indiscernible] . Who is an expertin machine learning and applied [Indiscernible] with the big data and application.And Alex form [Indiscernible] your please join me in welcoming our expert panel.Thank you for coming. Astronomy is a strong one. This is proof. Please describe it and briefly --what is learned. What is your take on the lessons learned. That can be applied in general toscientific discovery?We also cause it that Cosmo genome project. To map out the whole northern sky. It should beavailable for everyone on the planet. It seems like an incredible large amount of data. We wentto this process, working with Jim Gray her Microsoft. We realized that this is much bigger than[Indiscernible] . All of the information is it available at our anger tips. We can make this a dataavailable for everyone. It was a wonderful experience to try out new [Indiscernible] . How tocreate the new data. How they astronomy community adapts to new programming language. Isee the same patterns emerging would genomics. What the data sets and the scientist can do[Video and audio cutting in and out] . They can create -- all of that data about the whole sky.What did you find out question Mark -- ?One discovery am a --, one thing we did not think we could measure, was the [Indiscernible] ofthe early universe. We could figure out what the big band -- bang would look like.There is a visual imprint?It is like taken a big drum and put sand over it, we have the same picture of the universe.Lets go with Daphne Koehler from Stanford. You engaged in a [Indiscernible] project. Aboutputting out advanced online programs in college courses for computer science. There has beena big debate on online education. That is a field lacking in data. Briefly describe, the educationprocess. What is the potential of applications if you will for education.We should talk about big education. Where do we get the right training for 21st-century jobs.Education is a great equalizer come up but it has suffered from scarcity and affordable in theUnited States and around the world. What we have started at Stanford have a --, offering largeonline classes throughout the world. It will equalize society and provide opportunities for peoplethat would not have access to high-quality education. For the view that had [Indiscernible] . Thatsaid of the courses, with on line, with assessment, an accomplishment at the end. Provide uswith information on how humans learn hurt when you track the data from hundred of thousandsof people engaging in material, answering questions, as they interact with each other. That is astore of data we have not had in education. The studies, the typical sizes 20 or 50 people. Hereyou can actually study human learning when you have 100,000 people. And figure out whatpeople -- what makes people learn and what doesnt. This kind of big data is a surprising and anew opportunity for us to understand how people learn and how to learn better.In some way this is that technology will live -- nudge. That ultimate -- you are just starting butthis.
  8. 8. There was a paper from Bloom people were trained using a tutor. It was to standard deviationabove the norm [Video and audio cutting in and out] . You cannot afford a -- an individual tutorfor everyone. If you can get the online environment have the same personalization -- as ahuman tutoring setting. To patter recognitions -- get us to those goals that Bllom outlined.James, was a co-author of a basic study we done -- the McKinsey global Institute. It had a lot ofattention on many fronts. One thing that you said no, we need a, we need 149,000 more peoplethat have deep analytical skills. Give us a flavor -- it is a big gap. Give us a flavor of thenumbers and where they came from. What you see unfolding.It is very exciting about talking about big data. [Video and audio cutting in and out] . Workforceissue is one of those. One things that was striking in our research, this came from looking at acombination of what are the companys looking for -- they requirements. The challenges theytalk about. The thing they put at the top of the list, was the scale challenge. They can work theirway through technical issues, but the skill challenge isnt an issue. -- People can manipulatewith that data. When we look -- a combination of set skill in the workforce today. When weprojected for were to the next five years [Video and audio cutting in and out] that gap for thoseskills was at least hundred and 50 -- 159,000 workers. That intakes in count of everyone beingtrained. The next big challenge, is the group -- that data savvy managers. That decision makers.In the big data world, how you manage things changes dramatically. I know for an example, onecompany that takes advantage of big data in a huge way. The CEO actually [Video and audiocutting in and out] when I make hiring decisions [Video and audio cutting in and out] how do yourun experiments? How do you get insights push Mark -- ?[Video and audio cutting in and out] . It was at least 1 1/2 million gap in who was trained. I aminvolved with the US at Berkeley, getting enormous pressure from companies to create datamanagement courses. How do you educate a new generation? These are the more technical --the things in big data there are a lot of -- how do you connect structured databases with sourcesof data that gives you unstructured data. How do you connect location devices that are[Indiscernible] . There is a lot of data integration that are required. That are different from yourtypical databased manager. That group, that gap was 300 to 400 to 400,000. This resonate withthe challenges companies are having with filling these roles. This is a surprising thing per --.Lets give Lucila a chance for the policy am public purchase a patient site that goes along withprivacy issues especially with the health industry. [Video and audio cutting in and out] why didntyou tell us a little bit about privacy.E. manage and every visit, where you collect the data. And try to extract what worked and whatdid not work. [Video and audio cutting in and out] . Imagine all doctors -- all of the data availableto you as well as [Indiscernible] . As well as location, and environmental data and so forth. It isimportant for us to have this data so the analysts can have a solution [Indiscernible] . It isimportant that privacy be preserved. When you go to the doctor, you want your data and privacypreserved. I believe people that -- are nice tiered -- teary atWe would know why one in 80 kids in America are at risk or diagnosed with autism. What arethe factors? What can we learn when we assemble the data collect what -- correctly your --.So much of this -- it is that technical side. Be on that, there is the public acceptability. Youmention blood donation [Indiscernible] tiered that kid who saved another with a rare type.Nancys life was saved because of XYZ. As we [Indiscernible] . We established laws and rules
  9. 9. for credit cards, -- if you do the right thing it is a $50 limit. On the social engineering side, nomatter what you do tech now -- technically [Video and audio cutting in and out] . We have heardthis about electronic health records. These are issues. Are you guys working on this on thesocial engineering side?I owe they say the surge in gets the [Indiscernible] . We in informatics do not get any[Indiscernible] . We are behind the scene. We are trying to discover new things, so the nextgeneration will not suffer from the same issues. We are developing a consent managementsystem. In which we do exactly that. We ask the patient, do you want to donate your data, whatkind of data and who would you like to donate to? At one point you may will change your mind,and that should be available to you as a choice. People do not realize how much data theyalready donate.You talk about getting over the subconscious [Indiscernible] at giving data.For quite a long time, the astronomy community has been quicker to adopting [Indiscernible]doing the science. The only thing we could do was observed the skies. We could not change thestars and galaxies. We just could interpret that data. There are people who are doing labexperience -- they have several degrees of freedom. Astronomers learn about we sawsomething in the sky. We did not had a hypothesis that this was a collapsing nova. We just sawthe [Indiscernible] and made more observations. We try to figure out what is going on. So thecommunity was very quick to accept existing data sets. We had no subconscious your ears towork on.Their response, is collaboration is great that -- or it.It is clear the same revolution in the biological, a life-size -- like science. The early science[Indiscernible] was separate room what we have seen now. It is very similar. There is largersimulations. Some are under [Indiscernible] . When we analyze this data it is the same. It isimportant that these computations are done at the super -- supercomputer centers. To turn thesimulation into [Indiscernible] that everyone can play with .Daphne I do not want to leave you out. You published a paper last year. [Video and audiocutting in and out]The idea was to let the data speak for themselves. [Video and audio cutting in and out] ratherthan coming in with me particular perspective. Most of pathology looks at the structure and thestate -- shape of the breast cancer itself. The surrounding tissue is more [Indiscernible] then thatcancer cells themselves. This is where the benefits of large data can come to bear. That datacannot speak and tell us [Indiscernible] .Part of the assumption, we assume that cell. -- sale. Big data is -- this progression from data toknowledge is a path -- to have one believe. Your title was --Lets talk about the micro level. When we look at across the different companies and industrieshurt when you look at companies within the same industry, there is a wide divergence. For anexample, you have companies that think they have made a lot of progress. They have gonefrom analyzing data. Maybe they have analyzed it every month. They use it to figure out what todo next month. The companies at the other end, that are analyzing all of the data every day -- inreal time. And using that data to think about what date present to you next. Either on the
  10. 10. website or in the physical store. They have gone from data to action. You see a range ofpractices. This takes you back to innovation. You will see it in the results. What differentcompanies to is quite wide. It will take you to that end of Asian competitive -- innovatedcompetitive.If you think about everything from insurance -- for an example. We may sit in the samedemographic, but may be you drive in the middle of the night, anhydride during the day. That isupdate we did not know before. Now we can monitor that. We will find exactly how you drivewhen you drive. There is a lot of offerings and systems that are not efficient. There is moresegmentation. When you look across multiple sectors, the productivity potential to make certainsectors more productive is quite significant. We did some in-depth analysis. When you imaginewhat companies are doing using big data is very significant. This is another reason why we thinkthis is important. What they do point out, why this is interesting -- it is a huge volume of it. All ofthis data is digital. Whenever you have digital data, -- the keys what you can do to copy it -- playwith it, experiment is almost unbounded. I was surprised -- if you look at data 20 years ago, thatresided in research institutes. They had all of the data. Today it is a different. Most of it is in yourpocket right now. You are carrying it around and capturing data. Who has the data, what do theydo with it. That is a big shift. It is now pervasive across every sectors. What you will find, anycompany appreciable size -- who employ 1000 or more people, they have as much Deibert --data as the library of congress. The question becomes, how do companies use to leverage this?The part I like the most am a the good news am a much of this is going to be surplus. Which willbe HREF it thing. --If we stayed at a hotel to -- two or three time they should know who we are. We are setting upour own expectation. We rely on our cell phones to guide us around but the big benefit I like,what it will create for that consumer. The range of practices around -- across countries issurprising.The consumer benefit is price haggling. You are negotiation in the economic world areasymmetric more than ever.Not only do you know what price to pay, you will know which store actually has it that day. Youwill also know how far away you are that day. It gets more interesting.One thing, on energy. If you think about it, a lot of this is happening with any of us intervening.The number of things have a center, an IP address is almost limitless. That means, like inenergy, if you look at how we use energy. We leave things on. Lights, refrigerate should, --voters -- motors on. Now we can put these things everywhere, that benefit of energy[Indiscernible] . The potential is significant.On the privacy side. We have seen so much on my ability to [Indiscernible] data . I did a story afew years ago at -- on researchers that mashed up information. For 12 percent of the populationthey could get down to the nine digits -- their Social Security number. You understand this --what is the thing to say to people about their privacy?Mathematically you can not per -- guaranteed that people will not [Indiscernible] that data youdisclose. You can guarantee the risk. If people are willing to ask map -- except that risk. What isthe risk -- as I said before, much of that data has been released to, not researchers, not healthcare institutions -- people are using the Internet all times. They art donating data. The very thingto say, there is a benefit. It is not too one individual. The risk does exist, it can be made minimal.
  11. 11. I used to say, who cares about my health care data? In legislation and policy for insurance --they cannot use that for a particular purpose and cell one. -- so on.I have a question are all members of the panel. What concerns you about this? There are anumber of dimensions. In Stanford we are saying about [Video and audio cutting in and out]there is always a pattern. If the pattern significant? You have increased risk of falls positives --false positives. There is a high level pattern recognition. Models are per told -- brittle. This getsinto privacy -- discrimination based on that -- not really who you are. As practitioners in thisworld, what worries you?The concerns you raised are valid. When people misapply the to, -- data. You can haveconfidence in the pattern. It speaks to the need for fundamental science in pattern recognition.That plays a critical low in our ability to analyze the kinds of data we are seeing. And to avoidmismanaging.I would agree. Quite frankly, from the examples I have seen I am trusting of that data modelsthan the intuition base decision-making you often see. There are a couple of things, one is thequestion of discrimination of an economic sense. One thing to -- when you desegregate to sucha letter, to offer services to a certain demographics. The good news, quite often there are otherproviders that are happy with the right economic incentives. The other one I worry about, I amstruck in the private sector how -- in some instances data stops what is it interesting -- what isgoing on. Retailers can analyze so many things. When you have poor relations better[Indiscernible] . You will actually get -- sale of diapers are a big way next to the canons of --cans of formula. These are children diapers. [Laughter] . The point, you get issues that arepractical for retail -- but you do not know what is really going on.In many times it does happen. That is white many things are called I/O markers -- bio markers.It is important to have good analyst and good people to interpret the data. That is white trainingindividuals for that particular domain -- is a interesting proposition. Another thing I want toremind people, when the inventor of the airplane saw the airplane used in war. Was quitedisappointed. It does that mean the airplane should not have been invented, just for anotherpurpose. In our community, we are trying to invent something new that is very useful. We counton the larger community to regulate its use it is Mac --.It is a classic one in every field. It is a fact finding mission. It is easier to find it. It is more for thesocial science and politics.The bigger that data sets are, we will still see in here to -- uncertainties. There is the uncertainprinciple. We have to teach the next generation, and also statistical thinking. We have to becareful not just to give them a telescope for data but also [Indiscernible] .Before I forget, one early comment, you talk about training of people. Scientist training I shapepeople. And statisticians train -- TT shape people I gather your career is a good example.I got there by accident. I was into pewter science -- computer science.Deep in one area and brought in others [Indiscernible] .We have to start early. I take some of these grants address those.
  12. 12. I am JIm from RPI. You have been talking about math statistics and computer. If we want ourstudents to be analysis, we have to take -- teach about peer, a lot of step on the social side.How do you see that fitting into the education training?We have not touched on the other part -- that computer science. How we can transformprofessionals that are dealing with one data at a time, to understand the whole complexity of thedata. There are several initiatives for treating these individuals. I am glad to be part of them.You do bring back, in our case, doctors back to the computer science. So they can be that dataanalysts. Then they will marry that [Indiscernible] knowledge with that computer knowledge.It is that data savvy management. That applies to marketing, behavioral science -- there is a bigsecond swab. You have to be able to harness.We need to train pie shape lawyers. [Laughter]I want to address one point. Observation on people adopting these techniques. Not being ableto explain beer and it -- diapers. If you dig into that, dear and dipole -- beer and diapers arepeople who want to stay home and watch football. We were closely with the electronic records.One thing we face even with companies, and that -- how do you explain in common humanterms with the language that Dr. speed or the business speak -- how to back [Indiscernible] .You are right. One things when we find by machine learning -- the machine learning, is the 25 orso that the work in many cases. 75 percent is looking at the work. It is the human intuition. It isnot too replace the machine, but help dig in the patterns that are. -- better. That is importantwhen ever you do -- when the goal is discovery.It is one thing for these companies to higher 150,000 new workers, how certain are you theregoing to put their money where their mouth is. Instead of wishful thinking. It would be nice tohave these workers --The quick answer, right now there are lots of companies that have open racks -- in retail,financial services and Internet services. If you talk to how very in -- Hal Veran from Google, hewill tell you a or is it difficult [Indiscernible] in hiring statisticians. We had a word situation in theUS economy, there is a sorted out jobs. There are companies that have open [Indiscernible] ,you do survey after survey, most companies will say -- we have 40 percent open racks I cannottell because that particular skills.Let me add to that. In silicone Valley it is almost impossible to hire qualified engineer,developers -- it is a problem with this industry. You cannot find enough people to do the workrequired to make the economy to continue to grow. I do not take it is the lack of jobs. It is lack ofpeople with the right skills.I think the answer, if this is true being a competitive advantage -- you have to higher -- hirethem. Or be killed.I would like to make a comment. Regarding education. I am glad to see additional in the dataannouncement that we have several initiatives to support the education for the next generation.What I would like to point out come a when we educate data savvy or copy occasional say
  13. 13. every -- it has to be bilingual her for an example, we at NIH we are trying to train the youngergeneration am a --, they need to get their hands wet dirt if he only turning data savvy people --that is okay. When you train that younger generation they have to have pulled skills or they will -- both skills are they will fail.One thing we saw in our study, a lot of the data -- is unstructured data. Comments people willwrite. That are unique, it is Pacific to the discipline. Or a specific language. Who is contributingto the unstructured data -- and so forth. Most will tell you their biggest challenge -- capturingdata on their website. It is all on structured. -- unstructured.In addition to that, when you make predictive models, a lot of data savvy people -- you collectedthe data too late. You need to collect it the way we think is correct. Another question, that helpdonation of your personal data is a patient, there is a big movement, called, patient like me. Ithas similar outcomes compared [Indiscernible]. My question to do , we do not have good toolsto integrate that private donated data with our [Indiscernible] .Those efforts are undergoing. There are initiatives to do that. It is to integrate it with theelectronic health. I would like to comment -- we are training -- it is interdisciplinary -- we are notall the way [Indiscernible] dirt they are starting their careers. For an example, we are trainingthese individuals as well. We have to attack from all sides. If someone has health care degreethey can be complemented with a computer skill course. We need more of those.My name is Karen Keane. I want to go back to a comment, we do not have data in education.We do have a lot of information in a education that is not turned into data. The marginal cost ofelectronic data, in long-standing industries -- for the sake of argument. The data, the informationcollected is often not collected in a digital format. Often, the information that is the lack it --collected can drive decision. I was interested in from a medical perspective -- this is a problemthat the medical -- collecting a broader range of data. Thinking how does an existing group thatneeds to be able to process that data -- I can imagine a world where we have not, and coursestate standards. You could create massive data states if you could only collect the data isacross these groups. You would also have to change the way teachers and administrators in avariety of the other people work. Thinking about -- how can we learn from that experience tomaximize the potential for big data in places where it is not being taken advantage of.Someone who has worked with the electronic healthcare records. It is less rosy than you think.There are records out there in electronic form, when you actually look in their many fields aremissing. Many fields are wrong. I think the industry has a way to go before one can makesignificant use of the information that is in principle -- but it is difficult to get at. In this respect,the educational industry so to speak, because it is coming in later can avoid some of themistakes that not putting things in appropriate form. Collect things more systematically that -- soone can make that [Indiscernible] for analysis. We know that the significant patterns that could[Indiscernible] . Etiquette would be a tremendous boost on how to teach her children.We are made a tremendous progress. Making it rosier. There is room for improvement in theway we are collecting data. And the way we use existing data and transform it into computableformat. That is part of the education we need to inflect not only on the users but also theprofessionals who are dealing with healthcare informatics. The need for change it this, I thinkthere is tremendous progress. Going through steps. As much as we would like to leapfrog manysteps -- we need stages -- for the record to be useful. That information will be sparse. You donot need all the information. You do not need to ask all of the questions. You need to work with
  14. 14. sparse data. And to count the computer science community to develop ways to make the data iswork for us.It is to acknowledge the question -- the basis of the question. One thing we found out that wasinteresting, when we look across all industries -- there are three categories that are merged.There were sectors where everything was ready. The investment was there, they were openand competitive. There was a lot of interaction. All of the things will happen. Then you haveanother category in the middle. Or healthcare was one. There were challengers that you pointout, it was set up to be instrumental industry. Equipment that captures data. Payments goingback and forth. The instrument ability -- is very high. There was incentive, and privacy.Education was in the third. It was a challenge. There has been relatively few investments in thesystem. It is not a highly instrumented -- yet. If you can have standardized test -- it limits theextent where it is instrumental sector. It had the most challenge -- the education was the mostchallenge sector. At the same time it is a sector as an economy -- we expand enormous dollarsper we send a lot. It tends to be more efficient. This set up not to take an advantage. There is anational challenge, it is one sector -- how do we solve that one? It will not solve itself. I do notworry about us -- retail, but I worry about education.I am here writing a story for AOL. Several weeks ago there was the big government warm. --fourm. I think there is a disconnect between what you are doing and that intelligent is doing --CIA. We have 20 systems doing this on a massive scale. A lot of new vendor products -- thatseemed to be way beyond what you are talking about here. I think it needs to be informed andconnected with the intelligence community --We have gone through generations of leadership. And a lot of money.I do agree with that. The most amazing example of this -- but the defense and tell -- intelligence.The leading edge companies. To look for the most sophisticated models, you do need to look atagencies -- you also look at high-end retail. That is the [Indiscernible][Indiscernible-low volume]I think we have exhausted our time. They give her a much are in-- Thank you very much.[Event concluded]