First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

1. Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group

2. Who's who?

5. An introduction to scale @ SARA

6. An introduction to Hadoop & MapReduce

7. Hadoop @ SARA

8. Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA

9. (Jimmy Lin, University of Maryland / Twitter, 2011)

10. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)

11. s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)

13. Simple programming models

14. Easy-to-learn scripting

15. Anybody with the know-how can generate insights!

16. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge

18. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on L arge-Scale Computing , L arge-Scale Data Storage , H igh-Performance Networking , eScience , and Visualization

19. Large-scale data != new

22. MIMD: Multiple Instruction Multiple Data

23. MISD: Multiple Instruction Single Data

24. SISD: Single Instruction Single Data (Von Neumann)

25. Parallelism: Amdahl's law

26. Data parallelism

27. Compute @ SARA

28. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how (NYT, 14/06/2006)

30. A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

31. http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production

33. Move processing to the data

34. Process data sequentially, avoid random reads

35. Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)

37. Extract something of interest

38. Create an ordering in intermediate results

39. Aggregate intermediate results

40. Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)

42. reduce (k', v') -> <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest

43. The rest? Scheduling, data distribution, ordering, synchronization, error handling...

44. An overview of a Hadoop cluster

45. The ecosystem Hbase , Hive, Pig, HCatalog, Giraph, Elephantbird, and many others...

47. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

48. Architecture

49. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?

51. Natural Language Processing

52. Machine Learning

53. Econometry

54. Bioinformatics

55. Computational Ecology / Ecoinformatics

56. Machine learning: Infrawatch, Hollandse Brug

57. Structural health monitoring 145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

59. e.g. Twitter gardenhose data

60. e.g. Wikipedia dumps

61. e.g. del.ico.us & flickr tags

62. Finding named entities: [person company place] names

63. Creating inverted indexes

64. Piloting real-time search

65. Personalization

66. Semantic web

67. Interest from industry We're opening shop. Come and pilot.

70. Hadoop is probably not the best

71. Hadoop has momentum

73. The data center is your computer

74. Where is the data scientist? Much to learn & teach!

75. Any questions? [email_address] @eevrt @sara_nl

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

Similar to First NL-HUG: Large-scale data processing at SARA with Apache Hadoop (20)

Recently uploaded

Recently uploaded (20)

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop