Umsl big data


Published on

This is a talk I am giving to a UM St Louis MBA class on big data and hadoop

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • V3Use older Stampede deck for URL sourcesAdd Target pregnant teen and Twitter map and Kibera at end
  • So let's talk this morning about how Big Data does come from all corners of the globe and while it may not be evil, there are some fascinating examples of where it is being used by companies today and I'll review some of these case studies pulled from some of the articles that I and my colleagues in the IT trade press have been writing about over the past several months.
  • Let's start with planes, trains and automobiles. I am sure many of you remembered this movie with Steve Martin and John Candy and their various misadventures. Well, when it comes to Big Data the applications are a bit more positive.
  • As you know the US department of transportation collects monthly on-time statistics of each of the major airlines. But a better method is fromJeffrey Breen of Cambridge Aviation Research. He put this together to show sentiment analysis using the immediacy and accessibility of Twitter. He provides a real-time glimpse into consumer's frustration as you can see in this collection of Tweets.
  • Here is his flowchart of how it put this all together, using R and various other data collection tools to score the tweets and summarize it for each airline and compare it with what the federal government provides.
  • As you can see in this output, airlines such as Jet Blue and Southwest do a better job of customer service than the older, more established carriers such as Delta or United. So maybe it is a good thing that Southwest is now our major carrier here at Lambert, even though many of us miss all those non stop TWA flights to all those cities.
  • Moving from planes to trucks is this story about how FedEx is collaborating with General Electric – which is providing the company with commercial charging stations for its electric vehicles. While Fedex can tell you where a particular package is located in its network, it has other Big Data dilemmas including whether it makes sense for them to use electric power for its delivery trucks. They got together with GE, utility Con Edison and Columbia University researchers. The group are developing artificial intelligence programs to manage when and where the electric trucks charge in a 10-vehicle pilot project.“We’re collecting data on what is the load on the facility, what is the load of each truck, how many miles does that truck drive,” says Sondhi. “The algorithms from Columbia will identify that a truck is going to drive 16 miles tomorrow, so don’t give it 30 amps, give it 8 amps so we minimize the load on the entire facility.”
  • Your car has become a data hub, with USB ports, a SD card reader, Bluetooth connections to your phone and even a mobile Wifi hotspot. This next picture is a shot of the latest Ford My Touch dashboard that can be found in many of their cars. It provides all sorts of controls on what music you listen to, the indoor climate controls of your car, and a connection to your phone to dial your address book. Currently, Ford collects and aggregates data from the 4 million vehicles that use in-car sensing and remote app management software to create a virtuous cycle of information. The data allows Ford engineers to glean information on a range of issues, from how drivers are using their vehicles, to the driving environment, to electromagnetic forces affecting the vehicle, and feedback on other road conditions that could help them improve the quality, safety, fuel economy and emissions of the vehicle. Drivers willing to share how many miles they’ve traveled could get discounts between 10 and 40 percent in exchange for providing State Farm with a more accurate picture of their vehicle-use habits, which they obtain from directly accessing the Sync telematics systems in the cars electronically.
  • And finally we have trains, specifically the trains operated in Helsinki for its transit network. The organization uses Big Data tools to provide real time information on where their trains are located, and you can watch this web page to see where they are and when they will arrive at your location. A number of other transit agencies are doing something similar.
  • Twitter analyzed all their tweets and organized which ones got retweeted the most by state, split between Obama and Romney
  • Speaking of maps, there are thousands of big data mapping apps. Google Maps is certainly popular, but there are other sites making it even easier called Crowdmap and OpenStreet Maps. Here is a map that was crowd sourced of a neighborhood outside of Nairobi Kenya which until this effort was pretty much an uncharted territory. Thanks to this citizen effort, the community put together a map with all sorts of resources located.
  • Let me put up the next slide showing you something a bit more palatable. David Smith put this map together from about 400 wineries in the Napa Valley area. Not only can you scroll and zoom the map, but clicking on one of the winery markers will tell you its address and whether an appointment is required for tastings. He worked with Barry Rowlingson who used OpenStreetMaps and his own R package to build this. And while 400 data points doesn't sound like a very big collection of data, what these guys did is noteworthy since they used a collection of APIs and open source code to produce the final product.
  • Big Data is also being used inside IT organizations itself, as this next example from Nationwide Insurance' IT department shows. My first job out of college was working for the IT department of a major insurance company in New York City, back in the punch card era, so this example is particularly poignant to me. Accurate estimates of IT work effort are critical for deciding where in technology a business should invest. Lacking experience with similar projects, the business is often at a loss for hard data. In this article, we describe our benefit from the power and convenience of R in the elicitation task, or, in other words, in quantifying the uncertainty around IT project lifespans using probability distributions. The IT researchers from the insurance company show how R's built in functionality makes the elicitation task painless, while demonstrating how the methodology can be implemented in a user-friendly format. The power of R's probability toolbox allowed us to rapidly prototype an application which transported the basic concepts of elicitation to the IT project management space.
  • Big Data can be used in a variety of corporate settings and let's look at two examples of how it can control ovens big and small. This is a steel foundry. Here we are using R as a suitable means for solving the task of providing accurate, understandable and automatable models for the desired temperature predictions. The R-project has proved to be most useful for the implementation of the calculated results, the same as the external control of its functionalities in a process automation environment. The presented mathematical approach and the developed R-code and framework program enable steel plant production engineers and technical staff to plan, carry out and adjust their tasks and doings on the basis of highly stable and precise temperature preset-values. Instead of adding off-sets and thresholds to the assumed heat target temperatures and by that adding extra processing time and extra energy during each processing step, to be on the safe side and rather deliver the melt above the final casting temperature than below, the new temperature prediction model will allow for the optimization of process stability, throughput and material quality in the steel plant, especially in ladle treatment. 
  • But Big Data can be used in the smallest of ovens. Here we are looking at a hospital autoclave, which is used for sterilizing instruments. This is just one type of Industrial equipment which are among the products that Axeda is working with other companies to rig with sensors and cellular connections. Each of these devices has an IP address and an Internet connection, so that use of those devices can then be monitored remotely, so that their supply, maintenance and management can all be optimized, without having to go and look at the machines themselves. "Typically engineers would find logs through customer tickets and it would take months to find trends based on call center traffic,” You can collect data about uptime, need for repairs, machine run completion and detergent levels into a smartphone app that hospital employees can use.
  • Big Data can be used for all sorts of businesses, including in helping startups. The site Startup Compass collects data from tens of thousands of startups around the world. It then creates best practices, recommendations and benchmarks to help entrepreneurs make better product and business decisions. Startups can learn which key performance indicators actually matter. Most startups don’t even know which KPIs they should track or why they should track them. Second, they learn how their KPIs compare to other companies’ KPIs so they will know if they’re on the right track. See, for example, their customer acquisition costs. The third thing they learn is what actions they need to be taking. We help businesses take the next steps.”
  • Big Data is also being used in some of the world's largest corporations. We are looking at Proctor and Gamble’s Business Sphere big data situation room in their Cincinnati HQ. A big data analyst drives these large screens that display data visualizations on sales, market share, ad spending and the like, so everyone in the meeting is seeing the same information based on 4 billion daily transactions of P&G products. P&G isn’t after new data types; it still wants to share and analyze point-of-sale, inventory, ad spending, and shipment data. What’s new is the higher frequency and speed at which P&G gets that data, and the finer granularity. Even with all this gear, P&G has about two-thirds of the real-time data it needs.
  • They are trying to come to address the reason behind Why? was it a bad TV ad, out-of-stock shelves, or a competitor’s new product or price cut that caused a problem? Right now, the P&G IT team is working on automating analysis of the why, so employees get alerts when key events like a supply chain snafu or rival product launch happen. Their data visualizations can answer things such as -- Is a sales dip in detergent in France because of one retailer, so that’s where to focus?   Is that retailer buying less only in France, or across Europe? 
  • Saenz of Data Driven CEO
  • Let's move on to some of the Big Data rock stars that I have interviewed and really enjoy hearing from. Jeff Jonas is a data scientist that now works for IBM. One of his jobs was designing the casino security systems in Las Vegas, where he currently lives. He worked for the surveillance intelligence group of several casinos, and automated various manual processes, adding facial recognition software that was key to slowing down the MIT card counting group. "We built [another] system to immediately identify risk in real time so they could get these people out of the casino quickly." This software is still offered by IBM as its InfoSphere Identity Insight event processing and identity tracking technology.
  • If someone has three phone numbers - no big deal. On the other hand, if someone has five different dates of birth, that just doesn't seem quite right does it? That would be confusing. Why is this important? Well, if you are looking to analytics to make important decisions, wouldn't you want to know during the decision making process if there was related confusion ... before [any] action is taken.”
  • Another great Big Data scientist is Hilary Mason who works for She has analyzed shortened links posted to Twitter have a mean half life of 2.8 hours. Facebook boosts that to 3.2 hours, and direct sharing has a half-life of 3.4 hours. YouTube, however, beats them all hands down with a half life of 7.4 hours. In other words, you might get a slight edge by posting to Facebook versus Twitter (if you don't do both) but the content matters most. Good (or controversial) stuff rises to the top and has a longer life. Uninteresting stuff sinks quickly.
  • you need to start thinking about how to make your data sets smaller. "Big Data usually refers to a data set that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet," says Mason. This is the "scrub" section. The smaller the dataset, the easier it is to manipulate.
  • Mason and others have mentioned the now iconic Enron email archive that has since passed into the public domain and is used by a number of big data researchers to test their email algorithms and is available from a number of online academic websites -- It is an example of actual emails that forms the basis of many anti-spam programs these days, which is ironic given that their emails have outlasted the company where everyone once worked.
  • Here we are looking at a scene along a very famous street in San Francisco, Haight Street. You might remember if you are old enough the wild times in the late 1960s and early 70s where the intersection of Haight and Asbury was the center of counterculture and hippiedom. Today the area is still pretty much out there. Jesper Andersen gave this talk at Strata eariler this year and showed how to integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of the street. Surprisingly, you'll find a lot of Swedish folks on the upper half of Haight Street. Not surprisingly for San Francisco, many people on Haight speak Spanish or Japanese. Tweet stream analysis found that more negative sentiment on the lower part of the street, which corresponds with higher crime stats. I like this example because it shows what can be done to combine a variety of data sources to get more insight into where we all choose to live.
  • MaxPoint Interactive used their technology to find (down to the neighborhood-level) which areas of the country are most interested in BBQ foods. analyzed billions of data points consumed by neighborhoods across the U.S. such as offline point-of-sale data, social media, videos, music, local Web pages, and online magazines and recipes related to barbeque foods. They found that there were two very distinct neighborhood types when it comes to barbequing — those that prefer chicken and those that prefer pork, and Seattle and Portland Oregon were the top two rated cities when it comes to BBQ.
  • Here we are looking at a facsimile of an old newspaper – you remember newspapers, right? Ironically, it was called the New York Mirror. And while this and so many other newspapers have bit the dust, one operation that is still in business is The Associated Press. If you are looking for large content repositories, you probably can't get much larger than the article archive of the Associated Press. Today they announced they have launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. The project makes use of a solution from MarkLogic, a major Big Data enabler that is used by many different kinds of publishers for this type of purpose, such as Lexis/Nexis. The AP didn't start out by using the MarkLogic solution, but tried to implement a more traditional relational database structure only to run into problems. Their archives are in XML, which was difficult to design the right kind of data structures. Plus, they didn't have a consistent metadata collection across the archives. The MarkLogic implementation took 16 weeks from start to finish and was the first time that the AP had made use of their services. It enables them to run complex, Boolean searches across millions of articles in our content archive and get back precise returns in seconds or minutes instead of days or weeks. This much quicker response time is already transforming their B2B product offerings and helps them to manage searching for unstructured content in near real-time. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time.
  • The AP isn't the only journalistic Big Data effort going on. Here we are looking at the site for a company is called and was started by Julian Todd and Aidan McGuire, two U.K.-based analysts who have been long involved in opening up government data to the public.
  • This is showing data that was mined from the UN peacekeeping troop levels, as one example of what you can do with the scraperwiki site. They have lots of public data sets that are available for anyone to analyze and try to help journalists publish the information.
  • Moving closer to home we are looking at a St Louis based company called Appistry that is being used by a lot of different people for Big Data applications, including FedEx's logistics apps, Sprint's fraud detection services, and at defense contractor Northrop Grumman. San Francisco-based Presidio Health used a variety of products to boost its cloud performance. "Presidio had to handle a 16 times increase in data volume in a year and replace some aging hardware," says its CTO Thomas Gregory. It was able to increase its computing power by 70% without increasing the costs of its IT equipment. "We didn't want a lot of capital expense, and we wanted an environment that was safe and could spread our risk around." Presidio uses a combination of Eclipse and Spring-based open source software and Appistry for handling its cloud services management. "Appistry has integration with Spring, it was easy to use and saved us months of effort to move our software into this environment," he said. "Plus we don't have to expose any of our services externally.”
  •, on to sex. The dating site Okcupid looked through more than 4 million matches that they have made to find out patterns about gay and straight sexual preferences. The median number of sexual partners for both men and women are six, exploding the myth that gays are more promiscuous,
  • Here are straight people who either have had or would like to have a same-sex experience in the continental U.S. and lower Canada. You can see some sharp geographic divides.Awesomely, the mountain West lives up to its Brokeback reputation, and Canada is orange nearly coast-to-coast. Even in the yellow and blue areas, you can see pockets of gay curiosity in interesting places: Austin, Madison, Asheville. Anywhere soy milk is served, basically. This is based on millions of responses, On averageactive users have answered about 3000 questions; they've hidden the profiles of several thousand users they aren't interested in; they've voted for about 4000 profiles.
  • Here is another example from the OKcupid data set. They asked their members which is bigger, the earth or the sun, and you can see how the results sorted on based on gender. Guys really are dumber, sad to say.
  • One of my favorite Big Data hotbeds is Kaggle. They routinely hosts various big data contests and this one that concluded last month was a way for Facebook to evaluate prospective employees. More than 400 people submitted entries.
  • Still think big data is a lot of bull? Well, not according to the USDA. 8 million Holstein dairy cows in the United States, there is exactly one bull that has been scientifically calculated to be the very best in the land. He goes by the name of Badger-Bluff Fanny Freddie, who has 346 daughters who are on the books already. Their equations predicted from his DNA that he would be the best bullUSDA research geneticist reviewed pedigree records and looked at things such as milk production and fat and protein content to optimize the breed. To give you an idea of how this industry has changed, In 1942 the average dairy cow produced less than 5,000 pounds of milk in its lifetime. Now, the average cow produces over 21,000 pounds of milk.
  • These are automated milking machines from a Swedish company DeLaval. They are one of the vendors who are responsible for this incredible increase in milk production. You can see the small computer control station on the right and there is even an Internet connection so that farmers can monitor the milk collection remotely and running their herd from a laptop.
  • Finally, wanted to end today's presentation on a high note, as this article published earlier this summer in Tech Republic mentions the shift in IT jobs from coding to analysis. You folks are on the leading edge of that trend and so you should be feeling good about yourselves.
  • Here are some of the local meetups if you want to learn more about Big Data.
  • Thanks everyone for listening to me and good luck with your own Big Data explorations.
  • Umsl big data

    1. 1. It is time to learn about Big Data andHadoop! David Strom UMSL November 2012 Twitter: @dstromDownload this here:
    2. 2. My publicationsEditorial management positions:
    3. 3. So Big Data really is everywhere!• Planes, trains and automobiles• Fun with maps• Big and little ovens• Lessons learned from P&G• Noteworthy scientists• And of course sex!
    4. 4.
    5. 5. The reason behind
    6. 6. Three skills for big data CEOs• Strategic data planning. Data is the new raw material for any business.• Analytical skills. CEOs should be incredibly smart about asking the right questions.• Technology skills. Embrace the technology and make it a key part of your CEO skill set.
    7. 7. More from Jeff Jonas vs.
    8. 8. Mason’s 5-step Big Data process• Obtain• Scrub• Explore• Model• Interpret
    9. 9. Local Big Data Meetups
    10. 10. Thanks for your ideas!• Copies of this presentation:• My blog:• Follow me on Twitter: @dstrom• Old school: 43