Random notes on big data


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Random notes on big data

  1. 1. random notes on big data Chen Peng, Jianqiang Wang, Yang Huang April 19, 2013
  2. 2. What is big data
  3. 3. ● Volume: Gigabytes- >Terabytes - >Petabytes. ● Velocity: time sensitive, streaming, real-time. Jet engine: 20TB/hr GE: (minds + machines) ● Variety: structured/unstructur ed. ● Value: insights, analytical systems.
  4. 4. Challenges: collect, store, organize, analyze and share External > web sites (blogs/reviews) > social media (Facebook, LinkedIn, Google+, Twitter) > images and videos > ... Internal > transactions > server logs > machines and sensors > emails > ... Variety
  5. 5. Value Hierarchy Raw Data Normalized Insight Recommendation Transact Data is now a strategic asset
  6. 6. Technology stack & corresponding firms
  7. 7. Google App Engine Google BigQuery Scalable application development and execution environment Google Compute Engine Virtual machines Run arbitrary workloads at scale (e.g. Hadoop, scientific computing) Google Cloud Platform Google Cloud Storage Storage Connecting glue between each step of the data pipeline Data analysis Querying large datasets + third party apps for visualization (e.g. Tableau)
  8. 8. Big data analytics Analytics is The scientific process of transforming data into insights for making better decisions. Data Insight Decision IT logs, cloud, social media, sensors, experiments, etc. statistical & operations research modeling judgement, constraints, intuition "resource" "product" "goal"
  9. 9. Predictive analytics extracts information from data and use it to predict future trends and behavior patterns. regression models discrete choice models time series models classification models (decision tree, random forest, support vector machine, neural network, etc.) clustering models (k-means, density based, graph based, etc.) association analysis ... Big data analytics Descriptive Analytics Predictive Analytics Prescriptive Analytics
  10. 10. Always keep in mind... > business objectives are the origin of every data mining solution > data preparation is more than half of the data mining process > all patterns are subject to change > there will always be new knowledge Always pause and ask yourself: Does this work relate to the business question we try to answer? Is the original business question still valid?
  11. 11. Industry Use-cases/Application Healthcare Utilities Retail & marketing Financial services Telecom Use cases by industry
  12. 12. Industry applications of big data analytics Customer acquisition predict customers' buying habits in order to promote relevant products at multiple touch points. http://www.youtube.com/watch?feature=player_embedded&v=3WspJ16Ubhw Clinical decision support Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions, like diabetes, asthma, heart disease, and other lifetime illnesses. Cross sale predictive analytics can help analyze customers' spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers (beer & diaper) Ads targeting http://www.slideshare.net/dennyglee/yahoo-tao-case-study-excerpt
  13. 13. Fraud detection A predictive model can help weed out the "bads" and reduce a business's exposure to fraud. Image and Speech Recognition http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google. com/en/us/people/jeff/MIT_BigData_Sep2012.pdf Operations Jet Engine + Humans http://www.youtube.com/watch?v=JHc4ZTTWKrQ Industry applications of big data analytics Amazon wareouse operational efficiency: http://www.youtube.com/watch? v=Kafs9tZskuo
  14. 14. Beer and diaper
  15. 15. What are those startups doing? Bloomreach http://www.youtube.com/watch?feature=player_embedded&v=K12awAj4tW8 Datastax http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee- its-popularity.html?pagewanted=all Paraccel http://www.paraccel.com/solutions/paraccel-solutions-big-data.php#.UXG207WG3Ct Kaggle http://www.kaggle.com/c/acm-sf-chapter-hackathon-big
  16. 16. VC funding for "Big Data" Data from 71 start-ups. Funding is counted starting from 2004.
  17. 17. VC Funding Activity Data from 71 start-ups. Funding is counted starting from 2004.
  18. 18. Interesting view points " Special (domain) knowledge becomes less relevant; organizations should focus on collecting people who know how to extract value and insights from data." " In god we trust. All others must bring data." " The usefulness of a variable in a model is inversely related to the time you spend creating it." "Noise is convex but information is concave." "Big data is sexy but small data is beautiful." noise information data size
  19. 19. Interesting view points "All models are wrong, but some are useful." "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it; everyone thinks everyone else is doing it, so they claim they are doing it." "Statistics: The Art and Science of Learning from Data"
  20. 20. The danger of big data
  21. 21. Open discussion Potential opportunities / challenges for entrepreneurs? - visualization - internet of things - analytics as a service (a3 s) Standardization v.s. customization Human and data interaction - data v.s intuition
  22. 22. Back-Up Slides
  23. 23. Data Science v.s. OR risk management strategic planning predictive analytics optimization Risk Measurable of Objective skill sets of data scientists
  24. 24. Big data types ● Web & social media: clickstream, web content, amazon reviews, facebook postings & 'like'... ● M2M:smart meters, oil rig sensor reading, GPS signals... ● Transaction:retail store, healthcare claims, utility billing... ● Biometrics:fingerprint, face, voice, handwriting.. ● Human-generated data:call logs, emails, surveys...
  25. 25. Web & social media ● Transaction: orders, revenue, ● Conversion: click thru, convert to purchase,... ● Session: length, bounce rate ● Lifetime value: repeat, frequency,... ● Social interaction: intensity, influence,... Shopping cart analysis CTR prediction Personalization Retention/customer churn A/B testing Targeted ads Lifetime value
  26. 26. Interesting data visualization projects wind map http://hint.fm/wind/gallery/oct-30.js.html
  27. 27. Some analytical problems people deal with at Google ... ● search ranking
  28. 28. Processing Pipeline Hadoop MapReduce log sensor web ... Structured Data Note: Hadoop -- an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Orginated from Google MapReduce and further developed/promoted by Yahoo. SQL HIVE Dremel ... Analytics Big Data Cloud Computing http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
  29. 29. How big is big? When your data set becomes so large that you have to start innovating around how to collect, store, organize, analyze and share it ... External > web sites (blogs/reviews) > social media (Facebook, LinkedIn, Google+, Twitter) > images and videos > ... Internal > transactions > server logs > machines and sensors > emails > ...
  30. 30. Health care Sentiment analysis Patient monitoring Genetic Testing Electronic Medical Records Utilities Smart Meters Retail Loyalty programs RFID tags Recommenda tion, market basket Face recognition Telcos Customer churn Location- based IT Machine log Web & Social media M2M Transaction Biometrics Human- generat ed
  31. 31. Example of semantic graph
  32. 32. Call Data Record
  33. 33. What is Hadoop