Big Data examples


Published on

This is a talk that I gave at, a big data conference in St. Louis in August 2012

Published in: Technology, News & Politics
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Let’s look at planes, trains and automobiles first.
  •, the immediacy and accessibility of Twitter provides a real-time glimpse into consumer's frustration as you can see in this collection of just three tweets. Jeffrey Breen of Cambridge Aviation Research put this together to show sentiment analysis.
  • Here is his flowchart of how it put this all together, using R and various other data collection tools.
  • tackle what is essentially a Big Data dilemma, FedEx is collaborating with General Electric – which is providing the company with commercial charging stations – utility Con Edison and Columbia University researchers, who are developing artificial intelligence programs to manage when and where the electric trucks charge in a 10-vehicle pilot project.“We’re collecting data on what is the load on the facility, what is the load of each truck, how many miles does that truck drive,” says Sondhi. “The algorithms from Columbia will identify that a truck is going to drive 16 miles tomorrow, so don’t give it 30 amps, give it 8 amps so we minimize the load on the entire facility.”
  •, Ford collects and aggregates data from the 4 million vehicles that use in-car sensing and remote app management software to create a virtuous cycle of information. The data allows Ford engineers to glean information on a range of issues, from how drivers are using their vehicles, to the driving environment, to electromagnetic forces affecting the vehicle, and feedback on other road conditions that could help them improve the quality, safety, fuel economy and emissions of the vehicle. Here you see a typical Sync dash of a Ford sedan.Drivers willing to share how many miles they’ve traveled could get discounts between 10 and 40 percent in exchange for providing State Farm with a more accurate picture of their vehicle-use habits, which they obtain from directly accessing the Sync telematics systems in the cars electronically.Your car has become a data hub, with USB ports, a SD card reader, Bluetooth connections to your phone and even a mobile Wifi hotspot.
  • can watch the positions for the various trains in Helsinki as they move about the map here.
  • Speaking of maps, there are thousands of big data mapping apps. Google Maps is certainly popular, but another site makes it even easier called Crowdmap. Here is a map of sexual violence against Syrian women that was found using that service at
  • Smith put this together from about 400 wineries in the Napa Valley area. Not only can you scroll and zoom the map, but clicking on one of the winery markers will tell you its address and whether an appointment is required for tastings. He worked with Barry Rowlingson who used OpenStreetMaps and his own R package to build this map:
  • estimates of IT work effort are critical for deciding where in technology a business should invest. Lacking experience with similar projects, the business is often at a loss for hard data. In this article, we describe our benefit from the power and convenience of R in the elicitation task, or, in other words, in quantifying the uncertainty around IT project lifespans using probability distributions. We show how R's built in functionality makes the elicitation task painless, while demonstrating how the methodology can be implemented in a user-friendly format. The power of R's probability toolbox allowed us to rapidly prototype an application which transported the basic concepts of elicitation to the IT project management space.
  • seems a suitable means for solving the task of providing accurate, understandable and automatable models for the desired temperature predictions. The R-project has proved to be most useful for the implementation of the calculated results, the same as the external control of its functionalities in a process automation environment. The presented mathematical approach and the developed R-code and framework program enable steel plant production engineers and technical staff to plan, carry out and adjust their tasks and doings on the basis of highly stable and precise temperature preset-values. Instead of adding off-sets and thresholds to the assumed heat target temperatures and by that adding extra processing time and extra energy during each processing step, to be on the safe side and rather deliver the melt above the final casting temperature than below, the new temperature prediction model will allow for the optimization of process stability, throughput and material quality in the steel plant, especially in ladle treatment.
  • We are looking at a hospital autoclave, which is used for sterilizing instruments. This is just one type of Industrial equipment which are among the products that Axeda is working with other companies to rig with sensors and cellular connections. Each of these devices has an IP address and an Internet connection, so that use of those devices can then be monitored remotely, so that their supply, maintenance and management can all be optimized, without having to go and look at the machines themselves. "Typically engineers would find logs through customer tickets and it would take months to find trends based on call center traffic,” You can collect data about uptime, need for repairs, machine run completion and detergent levels into a smartphone app that hospital employees can use.
  • Startup Compass collects data from tens of thousands of startups around the world. It collects lots of data, then creates best practices, recommendations and benchmarks to help entrepreneurs make better product and business decisions. Startups can learn which key performance indicators actually matter. Most startups don’t even know which KPIs they should track or why they should track them. Second, they learn how their KPIs compare to other companies’ KPIs so they will know if they’re on the right track. See, for example, their customer acquisition costs. The third thing they learn is what actions they need to be taking. We help businesses take the next steps.”
  • is Proctor and Gamble’s Business Sphere big data situation room in their Cincinnati HQ. A big data analyst drives these large screens that display data visualizations on sales, market share, ad spending and the like, so everyone in the meeting is seeing the same information based on 4 billion daily transactions of P&G products. P&G isn’t after new data types; it still wants to share and analyze point-of-sale, inventory, ad spending, and shipment data. What’s new is the higher frequency and speed at which P&G gets that data, and the finer granularity. Even with all this gear, P&G has about two-thirds of the real-time data it needs.
  • They are trying to come to address the reason behind Why? was it a bad TV ad, out-of-stock shelves, or a competitor’s new product or price cut that caused a problem? Right now, the P&G IT team is working on automating analysis of the why, so employees get alerts when key events like a supply chain snafu or rival product launch happen. Their data visualizations can answer things such as -- Is a sales dip in detergent in France because of one retailer, so that’s where to focus?   - Is that retailer buying less only in France, or across Europe? 
  • Saenz of Data Driven CEO
  • Jeff Jonas is a data scientist that now works for IBM. One of his jobs was designing the casino security systems in Las Vegas, where he currently lives. He worked for the surveillance intelligence group of several casinos, and automated various manual processes, adding facial recognition software that was key to slowing down the MIT card counting group. "We built [another] system to immediately identify risk in real time so they could get these people out of the casino quickly." This software is still offered by IBM as its InfoSphere Identity Insight event processing and identity tracking technology.
  • If someone has three phone numbers - no big deal. On the other hand, if someone has five different dates of birth, that just doesn't seem quite right does it? That would be confusing. Why is this important? Well, if you are looking to analytics to make important decisions, wouldn't you want to know during the decision making process if there was related confusion ... before [any] action is taken."
  • Mason analyzed shortened links posted to Twitter have a mean half life of 2.8 hours. Facebook boosts that to 3.2 hours, and direct sharing has a half-life of 3.4 hours. YouTube, however, beats them all hands down with a half life of 7.4 hours. In other words, you might get a slight edge by posting to Facebook versus Twitter (if you don't do both) but the content matters most. Good (or controversial) stuff rises to the top and has a longer life. Uninteresting stuff sinks quickly.
  • you need to start thinking about how to make your data sets smaller. "Big Data usually refers to a data set that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet," says Mason. This is the "scrub" section. The smaller the dataset, the easier it is to manipulate.
  • Mason and others have mentioned the now iconic Enron email archive that has since passed into the public domain and is used by a number of big data researchers to test their email algorithms and is available from a number of online academic websites.
  • Andersen gave this talk at Strata eariler this year and showed how to integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of a very famous street in San Francisco, Haight Street. Surprisingly, you'll find a lot of Swedish folks on the upper half of Haight Street. Not surprisingly for San Francisco, many people on Haight speak Spanish or Japanese. Tweet stream analysis found that more negative sentiment on the lower part of the street, which corresponds with higher crime stats.
  • The Associated Press has launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time. The project makes use of a solution from MarkLogic. AP Creates New Big Data Approach to its Article ArchiveDavid Strom· March 19th, 2012 3 Comments58inShareIf you are looking for large content repositories, you probably can't get much larger than the article archive of the Associated Press. Today they announced they have launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time. The project makes use of a solution from MarkLogic, a major Big Data enabler that is used by many different kinds of publishers for this type of purpose, such as Lexis/Nexis. We have written about prior efforts by the AP to help modernize their archives, such as this project to provide non-profits with free information feeds.The AP didn't start out by using the MarkLogic solution, but tried to implement a more traditional relational database structure only to run into problems. Their archives are in XML, which was difficult to design the right kind of data structures. Plus, they didn't have a consistent metadata collection across the archives. The MarkLogic implementation took 16 weeks from start to finish and was the first time that the AP had made use of their services. It enables them to run complex, Boolean searches across millions of articles in our content archive and get back precise returns in seconds or minutes instead of days or weeks. This much quicker response time is already transforming their B2B product offerings and help them to manage searching for unstructured content in near real-time
  • company is called and was started by Julian Todd and Aidan McGuire, two U.K.-based analysts who have been long involved in opening up government data to the public.
  • This is showing data that was mined from the UN peacekeeping troop levels, as one example of what you can do with the scraperwiki site. They have lots of public data sets that are available for anyone to analyze and try to help journalists publish the information.
  • Appistry FedEx's logistics apps, Sprint's fraud detection services, and at defense contractor Northrop Grumman. San Francisco-based Presidio Health used a variety of products to boost its cloud performance. "Presidio had to handle a 16 times increase in data volume in a year and replace some aging hardware," says its CTO Thomas Gregory. It was able to increase its computing power by 70% without increasing the costs of its IT equipment. "We didn't want a lot of capital expense, and we wanted an environment that was safe and could spread our risk around." The company uses a combination of Eclipse and Spring-based open source software and Appistry for handling its cloud services management. "Appistry has integration with Spring, it was easy to use and saved us months of effort to move our software into this environment," he said. "Plus we don't have to expose any of our services externally."
  •, on to sex. The dating site Okcupid looked through more than 4 million matches that they have made to find out patterns about gay and straight sexual preferences. The median number of sexual partners for both men and women are six, exploding the myth that gays are more promiscuous,
  • Here are straight people who either have had or would like to have a same-sex experience in the continental U.S. and lower Canada. You can see some sharp geographic divides.Awesomely, the mountain West lives up to its Brokeback reputation, and Canada is orange nearly coast-to-coast. Even in the yellow and blue areas, you can see pockets of gay curiosity in interesting places: Austin, Madison, Asheville. Anywhere soy milk is served, basically. This is based on millions of responses, On averageactive users have answered about 3000 questions; they've hidden the profiles of several thousand users they aren't interested in; they've voted for about 4000 profiles.
  • When OKCupid asked its members for factual questions, this is how they sorted out by sexual preference and gender. We always knew that women were smarter.
  • Kaggle routinely hosts various big data contests and this one that concluded last month was a way for Facebook to evaluate prospective employees. More than 400 people submitted entries.
  • think big data is a lot of bull? Well, not according to the USDA. 8 million Holstein dairy cows in the United States, there is exactly one bull that has been scientifically calculated to be the very best in the land. He goes by the name of Badger-Bluff Fanny Freddie, who has 346 daughters who are on the books already. Their equations predicted from his DNA that he would be the best bullUSDA research geneticist reviewed pedigree records and looked at things such as milk production and fat and protein content to optimize the breed. To give you an idea of how this industry has changed, In 1942 the average dairy cow produced less than 5,000 pounds of milk in its lifetime. Now, the average cow produces over 21,000 pounds of milk.
  • Big Data examples

    1. 1. How Big Data Can Help Your Business:Case Studies from ReadWriteWeb David Strom StampedeCon August 1, 2012 Download this here:
    2. 2. My publicationsEditorial management positions:
    3. 3. Some oddball stuff• Planes, trains and automobiles• Fun with maps• Big and little ovens• Lessons learned from P&G• Noteworthy scientists• And of course sex!
    4. 4.
    5. 5. The reason behind
    6. 6. Three skills for big data CEOs• Strategic data planning. Data is the new raw material for any business.• Analytical skills. CEOs should be incredibly smart about asking the right questions.• Technology skills. Embrace the technology and make it a key part of your CEO skill set.
    7. 7. More from Jeff Jonas vs.
    8. 8. Mason’s 5-step Big Data process• Obtain• Scrub• Explore• Model• Interpret
    9. 9. Questions? David Strom 314 277 7832 @dstrom (Twitter)