Making Sense of Big data with Hadoop

  • 3,287 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,287
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
132
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server
  • We want the data, the whole data and nothing but the data.
  • You can no longer just throw one database at the problem and expect it to solve all your problems. Different parts of the solution require different technologies.I’ll talk mostly about Hadoop
  • Bad schema design is not big dataUsing 8 year old hardware is not big dataNot having purging policy is not big dataNot configuring your database and operating system correctly is not big dataPoor data filtering is not big data eitherKeep the data you need and use. In a way that you can actually use it.If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.
  • We always wanted more data. We never wanted to have to aggregate and then delete old data. We knew we were missing details, subtleties, opportunities. But we had to – because we wanted better performance and couldn’t afford unlimited disks.With new technologies, more data is more feasible.
  • One of the main reasons for the explosion of data stored in the last few years is that many problems are easier to solve if you apply more data to them.Take the Netflix Challenge for example. Netflix challenged the AI community to improve the movie recommendations made by Netflix to its customers based on a database of ratings and viewing history. Teams that used the available data more extensively did better than teams that used more advanced algorithms on a smaller data set.More data also allows businesses to make better, more informed decisions. Why have focus groups to decide on new store design, if you can re-design several stores and compare how customers proceeded through each store and how many left without buying? On-line stores make the process even easier.Modern businesses become more scientific and metrics driven, and rely less on “gut feeling” as the cost of making business experiments and measuring the results decrease.
  • Data also arrives in more forms and from more sources than ever. Some of these don’t fit into a relational database very well, and for some, the relational database does not have the right tools to process the data.One of Pythian’s customers analyses social media sources and allow companies to find comments of their performance and service and respond to complaints via non-traditional customer support routes.Storing facebook comments and blog posts in Oracle for later processing, results in most of the data getting stored in BLOBs, where it is relatively difficult to manage. Most of the processing is done outside of Oracle using Nature Language Processing tools. So, why use Oracle for storage at all? Why not store and process the documents elsewhere and only store the ready-to-display results in Oracle?
  • Data, especially from outside sources is not in a perfect condition to be useful to your business.Not only does it need to be processed into useful formats, it also needs:Filtering for potentially useful information. 99% of everything is crapStatistical analysis – is this data significant?Integration with existing dataEntity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle Corporation”? De-DuplicationGood processing and filtering of data can reduce the volume and variety of data. It is important to distinguish between true and accidental variety.This requires massive use of processing power. In a way, there is a trade-off between storage space and CPU. If you don’t invest CPU in filtering, de-duping and entity resolution – you’ll need more storage.
  • Data warehouses require the data to be structured in a certain way, and it has to be structured that way before the data gets into the data warehouse. This means that we need to know all the questions we would like to answer with this data when designing the schema for the data warehouse.This works very well in many cases, but sometimes there are issues:The raw data is not relational – images, video, text and we want to keep raw data for future useThe requirements from the business frequently changeIn these cases it is better to store the data and create patterns from it as it is parsed and processed. This allows the business to move from large up-front design to just-in-time processing.For example: Astrometry project searches Flickr for photos of night sky, identifies the part of the sky its from and the prominent celestial bodies and creates a standard database of the position of elements in the sky.
  • The new volume of data, and the need to transform it, filter it and clean it up require:Not only more storage, but also faster access ratesReliable storage. We want high availability and resilient systemsYou also need access to as many cores as you can get, to process all this dataThese cores should be as close to the data as possible to avoid moving large amounts of data on the netThe architecture should allow to use many of the cores in parallel for data processing
  • Hadoop is the most common solution for the new Big Data requirement. It’s a scalable distributed file system, and a distributed job processing system on top of the file system.It is a PLATFORM, not a solution – so Hadoop is unlikely to make your life easier. A lot of querying and processing tasks are more difficult with Hadoop than without. But it makes previously expensive things cheaper, and previously impossible things, possible. This lets companies keep massive amounts of unstructured data and efficiently process it. The assumption behind Hadoop is that most jobs will want to scan entire data sets, not specific rows or columns. So efficient access to specific data is not a core capability.Hadoop is open source, and there is a large eco-system of tools, products and appliances built around it.Open source tools that make data processing on Hadoop easier and more accessible, BI and integration products, improved implementations of Hadoop that are faster or more reliable, Hadoop cloud services and hardware appliances.
  • There is growing demand for:Real time analyticsServing data processed by Hadoop to customers with very low latency
  • The exact balance depends on your workload – more CPU heavy?Just lots of data? Lots of disk bandwidth?More nodes: cheaper scalability, more resilient. But – higher cost of administration.
  • .
  • Modern data centers generate huge amounts of logs from applications and web services.These logs contain very specific information about how users are using our application and how the application performs.Hadoop is often used to answer questions like:How many users use each feature in my site?Which page do users usually go to after visiting page X?Do people return more often to my site after I made the new changes?What use patterns correlate with people who eventually buy a product?What is the correlation between slow performance and purchase rates?Note that the web logs can be processed, loaded into RDBMS and parsed there. However, we are talking about very large amounts of data, and each piece of data needs to be read just once to answer each question. There are very few relations there. Why bother loading all this to RDBMS?
  • Hadoop has large storage, high bandwidth, lots of cores and was build for data aggregation.Also, it is cheap.Data is dumped from the OLTP database (Oracle or MySQL) to Hadoop. Transformation code is written on Hadoop to aggregate the data (this is the tricky part) and the data is loaded to the data warehouse (usually Oracle).This is such a common use case that Oracle built an appliance especially for this.
  • A lot of the modern web experience revolves around websites being about to predict what you’ll do next or what you’d like to do but don’t know about yet.People you may knowJobs you may be interested inOther customers who looked at this product eventually bought…These emails are more important than othersTo generate this information, usage patterns are extracted from OLTP databases and logs, the data is analyzed, and the results are loaded to an OLTP database again for use by the customer.The analysis task started out as daily batch job, but soon users expected more immediate feedback. More processing resources were brought in to speed up the process. Then the system started incorporating customer feedback into the analysis when making new recommendations. This new information needed more storage and more processing power.
  • One customer uses tweeter as input for customer support. They search tweeter and save all feeds that mention their customers (say, AT&T) to their Hadoop cluster, mine them for relevant information (user, location, what is the problem, how popular he is). They then open tickets in traditional customer support system based on this information. They can also go back and mine all the saved data for re-occurring complaints, problem areas, etc.Another use case is analysts and marketing departments who mine blogs and job postings to find trending topics.
  • Start with well-defined and obviously important requirement
  • Oracle’s Big Data machine was built to move data between Oracle RDBMS and Hadoop fast, and I doubt if anyone can beat Oracle at that.Both the tools that are bundled with the machine and the fast IB connection to Exadata make it very attractive for businesses wishing to use Hadoop as ETL solution. Note that the tools should also be avba

Transcript

  • 1. Making Sense ofBIG DATA with Hadoop
  • 2. ● 13 years with a pager● Oracle ACE Director● Oak table member● Senior consultant for Pythian● @gwenshap● http://www.pythian.com/news/ author/shapira/● shapira@pythian.com © 2012 Pythian
  • 3. Pythian Recognized Leader: • Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and Microsoft SQL Server • Work with over 165 multinational companies such as LinkShare Corporation, IGN Entertainment, CrowdTwist, TinyCo and Western Union to help manage their complex IT deployments Expertise: • One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the MySQL community, driving the MySQL Professionals Group and sit on the IOUG Advisory Board for MySQL. • Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC Global Reach & Scalability: • 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response3 © 2012 Pythian
  • 4. What is Big Data?
  • 5. MORE DATA THANYOU CAN HANDLE © 2012 Pythian
  • 6. MORE DATA THANRELATIONALDATABASESCAN HANDLE © 2012 Pythian
  • 7. MORE DATA THANRELATIONALDATABASESCAN HANDLECHEAPLY © 2012 Pythian
  • 8. Data Arriving at fast RatesTypically unstructuredStored without aggregationAnalyzed in Real TimeFor Reasonable Cost © 2012 Pythian
  • 9. Complex Data Architecture © 2012 Pythian
  • 10. Your Data is NOT as BIG as you think© 2012 Pythian
  • 11. Why Big Data?Why Hadoop?
  • 12. BECAUSE WE CAN © 2012 Pythian
  • 13. More Data Beats SmarterAlgorithms © 2012 Pythian
  • 14. email Photos Job postingTweets Video Medical imaging Sensors Blog posts Tags Scanned docs © 2012 Pythian
  • 15. Data is Messy
  • 16. An Imperial College Team found: •3,000 patients under 19 were treated in geriatric clinics • between 15,000 and 20,000 men have been admitted to obstetric wards •and almost 10,000 to gynecology wards http://www.straightstatistics.org/blog/2012/04/06/why-are-so-many-men-pregnant16 © 2012 Pythian
  • 17. UnstructuredEventually Structured Data
  • 18. Scalable Storage +Massive Parallel Processing + Reasonable Cost © 2012 Pythian
  • 19. Hadoop: Platform for distributedcomputing © 2012 Pythian
  • 20. Hadoop is Scalable. But not fast. © 2012 Pythian
  • 21. Much Ado about Hadoop
  • 22. Assumptions• Lots of data• Large Files• Unstructured• Scan entire files• Unreliable Hardware• Adding servers = increase capacity © 2012 Pythian
  • 23. Principles• Bring Code to Data• Share Nothing © 2012 Pythian
  • 24. HDFS• Distributed• Replicated• Big Files• Write Once• Read Entire File © 2012 Pythian
  • 25. /users/shapira/log-1, blocks {1,4,5} /users/shapira/log-2, blocks {2,3,6}1 4 5 2 3 1 452 4 1 3 2 36 6 5 6 © 2012 Pythian
  • 26. Map Reduce Combine Map Reduce Start Map Stop Job 1 Reduce? Job 1 … … Map Reduce? Hadoop Job Results Combine Map Reduce Start Map Reduce? Job 2 Stop … Job 1 … Map Reduce?
  • 27. Implementation• Balance disks, cores and RAM• High Bandwidth• More nodes or better nodes? © 2012 Pythian
  • 28. It’s about the Ecosystem• Sqoop• Flume• Hive• Pig• HBase © 2012 Pythian
  • 29. Use Cases
  • 30. Use Case:Log processing
  • 31. Use Case: ETL BIOLTP DWH © 2012 Pythian
  • 32. Use Case:Recommendations
  • 33. Use case:Listening to the crowd © 2012 Pythian
  • 34. Our customers use Hadoop for: • Storing lots of pre-processed data • Merging different data types • Scalable data processing • Advanced data processing34 © 2012 Pythian
  • 35. Big Data in your Company
  • 36. Easy case:Your CTO heard about Big DataAnd is eager to invest.You have a Big Budget. © 2012 Pythian
  • 37. RequireMeasure Acquire Serve Organize Analyze © 2012 Pythian
  • 38. Require Hadoop Measure NoSQL OLTP BI,NoSQL, RDMBOracle Hadoop BI, R © 2012 Pythian
  • 39. Data Scientist=Sneaky BIDisregards SilosCool Toys © 2012 Pythian
  • 40. Mining Tools:• Machine Learning• Cluster Detection• Regression• Graph Analysis• Visualization © 2012 Pythian
  • 41. http://nicolasrapp.com/?p=1118 © 2012 Pythian
  • 42. http://www.orgnet.com/slumlords.html © 2012 Pythian
  • 43. Want to do more with your data?Don’t know where to start?No budget?No problem! © 2012 Pythian
  • 44. Sneak Hadoop to Your Business• Find an important business problem• Acquire data (be sneaky!)• Get the tools: R, Hadoop, Tableau• Laptops, desktops, test servers• Analyze data• Make pretty charts• Get business used to it• Wait for an Outage• PROFIT! © 2012 Pythian
  • 45. Oracle Big DataThe “ETL Machine”
  • 46. Hardware18 servers216 cores864G RAM648T disksInfiniband © 2012 Pythian
  • 47. SoftwareOracle NoSQLCloudera Hadoop DistributionOracle Loader for HadoopData Integrator for HadoopDirect Connector for HadoopOracle Connector for R © 2012 Pythian
  • 48. Cores, Storage, Infiniband and SoftwareMakes Oracle Big DataThe Ultimate ETL Machine © 2012 Pythian
  • 49. Thank you & Q&A To contact us… sales@pythian.com 1-866-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/ http://twitter.com/pythian http://www.linkedin.com/company/pythian49 © 2012 Pythian