There are numerous analytical techniques that can be used to examine Big Data sources. I describe several of the more popular ones in this talk for a Washington University roundtable discussion in July 2015
V4 J School additions Use older Stampede deck for URL sources
http://www.readwriteweb.com/cloud/2012/02/strata-2012-3-essential-skills.php Diego Saenz of Data Driven CEO
So let's talk this morning about how Big Data does come from all corners of the globe and while it may not be evil, there are some fascinating examples of where it is being used by companies today and I'll review some of these case studies pulled from some of the articles that I and my colleagues in the IT trade press have been writing about over the past several months.
As you know the US department of transportation collects monthly on-time statistics of each of the major airlines. But a better method is from Jeffrey Breen of Cambridge Aviation Research. He put this together to show sentiment analysis using the immediacy and accessibility of Twitter. He provides a real-time glimpse into consumer’s frustration, using this flowchart with R and various other data collection tools to score the tweets and summarize it for each airline and compare it with what the federal government provides.
Your car has become a data hub, with USB ports, a SD card reader, Bluetooth connections to your phone and even a mobile Wifi hotspot. This next picture is a shot of the latest Ford My Touch dashboard that can be found in many of their cars. It provides all sorts of controls on what music you listen to, the indoor climate controls of your car, and a connection to your phone to dial your address book. Currently, Ford collects and aggregates data from the 4 million vehicles that use in-car sensing and remote app management software to create a virtuous cycle of information. The data allows Ford engineers to glean information on a range of issues, from how drivers are using their vehicles, to the driving environment, to electromagnetic forces affecting the vehicle, and feedback on other road conditions that could help them improve the quality, safety, fuel economy and emissions of the vehicle. Drivers willing to share how many miles they’ve traveled could get discounts between 10 and 40 percent in exchange for providing State Farm with a more accurate picture of their vehicle-use habits, which they obtain from directly accessing the Sync telematics systems in the cars electronically.
Using Tableau and open street map data, you can spot trends in Austin’s teacher turnover. While it is a city-wide problem, it is particularly acute in the poorer areas of the east side.
But Big Data can be used in the corporate situations that are fairly mundane. Here we are looking at a hospital autoclave, which is used for sterilizing instruments. This is just one type of Industrial equipment which are among the products that Axeda is working with other companies to rig with sensors and cellular connections. Each of these devices has an IP address and an Internet connection, so that use of those devices can then be monitored remotely, so that their supply, maintenance and management can all be optimized, without having to go and look at the machines themselves. "Typically engineers would find logs through customer tickets and it would take months to find trends based on call center traffic,” You can collect data about uptime, need for repairs, machine run completion and detergent levels into a smartphone app that hospital employees can use.
Big Data is also being used in some of the world's largest corporations. We are looking at Proctor and Gamble’s Business Sphere big data situation room in their Cincinnati HQ. A big data analyst drives these large screens that display data visualizations on sales, market share, ad spending and the like, so everyone in the meeting is seeing the same information based on 4 billion daily transactions of P&G products. P&G isn’t after new data types; it still wants to share and analyze point-of-sale, inventory, ad spending, and shipment data. What’s new is the higher frequency and speed at which P&G gets that data, and the finer granularity. Even with all this gear, P&G has about two-thirds of the real-time data it needs.
Let's move on to some of the Big Data rock stars that I have interviewed and really enjoy hearing from. Jeff Jonas is a data scientist that now works for IBM. One of his jobs was designing the casino security systems in Las Vegas, where he currently lives. He worked for the surveillance intelligence group of several casinos, and automated various manual processes, adding facial recognition software that was key to slowing down the MIT card counting group. "We built [another] system to immediately identify risk in real time so they could get these people out of the casino quickly." This software is still offered by IBM as its InfoSphere Identity Insight event processing and identity tracking technology.
Mason and others have mentioned the now iconic Enron email archive that has since passed into the public domain and is used by a number of big data researchers to test their email algorithms and is available from a number of online academic websites -- It is an example of actual emails that forms the basis of many anti-spam programs these days, which is ironic given that their emails have outlasted the company where everyone once worked.
Here we are looking at a facsimile of an old newspaper – you remember newspapers, right? Ironically, it was called the New York Mirror. And while this and so many other newspapers have bit the dust, one operation that is still in business is The Associated Press. If you are looking for large content repositories, you probably can't get much larger than the article archive of the Associated Press. They have launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. The project makes use of a solution from MarkLogic, a major Big Data enabler that is used by many different kinds of publishers for this type of purpose, such as Lexis/Nexis. The AP didn't start out by using the MarkLogic solution, but tried to implement a more traditional relational database structure only to run into problems. Their archives are in XML, which was difficult to design the right kind of data structures. Plus, they didn't have a consistent metadata collection across the archives. The MarkLogic implementation took 16 weeks from start to finish and was the first time that the AP had made use of their services. It enables them to run complex, Boolean searches across millions of articles in our content archive and get back precise returns in seconds or minutes instead of days or weeks. This much quicker response time is already transforming their B2B product offerings and helps them to manage searching for unstructured content in near real-time. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time.
One of my favorite Big Data hotbeds is Kaggle. They routinely hosts various big data contests and this one that concluded last month was a way for Facebook to evaluate prospective employees. More than 400 people submitted entries.
Here are some of the local meetups if you want to learn more about Big Data.
Thanks everyone for listening to me and good luck with your own Big Data explorations.
Big data analytics
Big Data Analytics
July 2015 Wash Univ.
Download this here:
Three necessary skills
• Strategic data planning. Understand how data
is the new raw material for any modern
• Analytical skills. What is the data trying to tell
• Technology skills. Embrace the technology
and make it a key part of your skill set.
• Tracking Twitter airline sentiment
• Using car-generated GPS data
• Analyzing maps
• What you can glean from your log files
• How P&G does it big-time
• Betting on Big Data with IBM
• The infamous Enron email data set
• Trends from AP’s news archive
Thanks for your ideas!
• Copies of this presentation:
• My blog: http://strominator.com
• Follow me on Twitter: @dstrom
• Old school: firstname.lastname@example.org