The other Apache technologies your big data solution needs!


Published on

An overview of the various Apache Technologies to help you build your own Big Data solution.

A talk given at Berlin Buzzwords, in June 2013.

Published in: Technology

The other Apache technologies your big data solution needs!

  1. 1. The other ApacheTechnologies your BigData solution needs!Nick BurchCTO, Quanticate
  2. 2. Copyright © Quanticate 2013Nick BurchCTO, Quanticate
  3. 3. The Apache Software FoundationApache Technologies as in the ASF117 Top Level Projects37 Incubating Projects (> 100 past ones)Y is the only letter we lackA, C, O, S & T are popular letters, >= 10 eachMeritocratic, Community drivenOpen Source
  4. 4. A Growing Foundation...June 2013:117 Top Level Projects37 Incubating ProjectsJune 2013:104 Top Level Projects50 Incubator ProjectsJune 2011:91 Top Level Projects59 Incubating Projects
  5. 5. What were not covering
  6. 6. Projects not being coveredCassandra cassandra.apache.orgCouchDB couchdb.apache.orgDrill (Incubating) giraph.apache.orgHadoop hadoop.apache.orgHive hive.apache.orgHBase
  7. 7. Projects not being covered, part 2Lucene lucene.apache.orgSolr lucy.apache.orgMahout mahout.apache.orgNutch nutch.apache.orgOODT oodt.apache.orgAll covered in other sessions
  8. 8. What we are looking at
  9. 9. Talk StructureLoading and querying Big DataBuilding your MapReduce JobsDeploying and Building for the CloudServers for Big DataBuilding out your solutionMany projects to cover –this is only an overview!
  10. 10. Audience Participation:A show of hands...
  11. 11. Making the most of the talkSlides will be available after the talk– See @Gagravarr for the linkLots of information and projects to coverJust an overview, see project sites for moreTry to take note / remember the projects youthink will matter to youDont try to take notes oneverything – theres too much!
  12. 12. Loading and Querying
  13. 13. Pig – pig.apache.orgOriginally from Yahoo, entered the Incubatorin 2007, graduated 2008, very widely usedProvides an easy way to query data, which iscompiled into Hadoop M/R (not SQL-like)Typically 1/20thof the lines of code, and 1/15thof the development timeOptimising compiler – often onlyslower, occasionally faster!
  14. 14. Pig – pig.apache.orgShell, scripting and embedded JavaLocal mode for developmentBuilt-ins for loading, filtering, joining,processing, sorting and savingUser Defined Functions tooSimilar range of operations as SQL, butquicker and easier to learnAllows non coders to easily query
  15. 15. Gora – gora.apache.orgORM Framework for (NoSQL) Column StoresGrew out of the Nutch projectSupports HBase, Cassanda, HypertableK/V: Voldermort, RedisHDFS Flat Files, plus basic SQLData is stored in Avro (more later)Query with Pig, Lucene, Hive, Cascading,Hadoop M/R, or native Store code
  16. 16. Gora – gora.apache.orgExample: Web Server LogAvro data bean, JSON{"type": "record","name": "Pageview","namespace": "org.apache.gora.tutorial.log.generated","fields" : [{"name": "url", "type": "string"},{"name": "timestamp", "type": "long"},{"name": "ip", "type": "string"},{"name": "httpMethod", "type": "string"},{"name": "httpStatusCode", "type": "int"},{"name": "responseSize", "type": "int"},{"name": "referrer", "type": "string"},{"name": "userAgent", "type": "string"}]}
  17. 17. Gora// ID is a long, Pageview is compiled Avro beandataStore = DataStoreFactory.getDataStore(Long.class, Pageview.class);// Parse the log file, and storewhile(going) {Pageview page = parseLine(reader.readLine());dataStore.put(logFileId, page);}DataStore.close();private Pageview parseLine(String line) throws ParseException {StringTokenizer matcher = new StringTokenizer(line);//parse the log lineString ip = matcher.nextToken();...//construct and return pageview objectPageview pageview = new Pageview();pageview.setIp(new Utf8(ip));pageview.setTimestamp(timestamp);...return pageview;}
  18. 18. Accumulo – accumulo.apache.orgDistributed Key/Value store, built on top ofHadoop, Zookeeper and ThriftInspired by BigTable, with someimprovements to the designCell level permissioning (access labels),write constraints, and server side hooks totweak data as its read/writtenSupports v.large rowsInitial work mostly done by the NSA!
  19. 19. Accumulo – accumulo.apache.orgHigh availability (uses Zookeeper)Distributed WAL via loggers, later to HDFSLogical time – tablets can have wrong dates!FATE – Fault Tolerant ExecutionsTesting Friendly!Mocks, Functional Tests, Scaling Tests, andrandom walk testsAdmin support OOTB
  20. 20. Sqoop – (no u!)Bulk data transfer tool for big data systemsHadoop (HDFS), HBase and Hive on one sideSQL Databases on the otherCan be used to import existing data into yourbig data clusterOr, export the results of a big datajob out to your data wharehouseGenerates code to work with data
  21. 21. Chukwa (Incubating)Log collection and analysis framework basedon HadoopIncubating since 2010Collects and aggregates logs from manydifferent machinesStores data in HDFS, in chunks that are bothHDFS and Hadoop friendlyLets you dump, query and analyze
  22. 22. Chukwa (Incubating)Chukwa agent runs on source nodesCollects from Log4j, Syslog, plain text logfiles etcAgent sends to a Collector on the HadoopclusterCollector can transform if neededData written to HDFS, and optionallyto HBase (needed for visualiser)
  23. 23. Chukwa (Incubating)Map/Reduce and Pig query the HDFS files,and/or the HBase storeCan do M/R anomaly detectionCan integrate with Hiveeg Netflix collect weblogs with Chukwa,transform with Thrift, and store in HDFSready for Hive queries
  24. 24. Flume – flume.apache.orgAnother Log collection frameworkConcentrates on rapidly collecting,aggregating and moving large amounts oflogs from many sources to a central placeTypically write to HDFS + Hive + FTSSource, Channels and SinksSource consumes events, eglog file entries, avro data
  25. 25. Flume – flume.apache.orgSource places into a 1 or more channels,which can be persistent or in-memorySink pulls events from channel, then storesCan chain together, fan-in or fan-outData and Control planes independentDoes more OOTB than ChukwaAdding source+sink easyLess scope to customise otherwise
  26. 26. Flume – flume.apache.orgAnother Log collection frameworkConcentrates on rapidly getting data to avariety of sourcesTypically write to HDFS + Hive + FTSJoint Agent+Collector modelData and Control planes independentMore OOTB, less scope to alter
  27. 27. Blur (Incubating)Big Data search for structured dataHadoop + HDFS + Lucene + Thrift +ZookeeperMap/Reduce to process into thrift data + indexBlur distributes index shards across clusterMultiple controllers for fault toleranceAutomatic Shard FalloverWeb based query interface
  28. 28. Falcon (Incubating)Data LifeCycle Management for HadoopProvides support to applications to implementthe key parts – import, export, cleaning,duplication, recovery, retention etcDeclarative definitions, Falcon does deps etcControlled replication, for example only outputsof certain stages in the flowLots of tests for the key parts
  29. 29. Hama – hama.apache.orgBulk Synchronous Parallel (BSP) on HDFSMassive scale scientific computions, eg Matrix,Graph and Network algorithmsBSP can be better than M/R or MPIMessage Passing based applicationsNo deadlocks or collisionsFlexible, simple APIHigh performance
  30. 30. MRQL – mrql.apache.orgMRQL, pronounced as MiracleLarge scale data analytics engineWorks on top of Hadoop or HamaSQL-like query languageSupports more complex types and constructsthan Pig or Hive, so you can avoid having towrite explicit M/R codePage Rank, k-means, matrix etc
  31. 31. Tajo (Incubating)Data warehouse system for HadoopLow-latency and scalable ad-hoc queriesHDFS for storage, custom query engine whichcontrols execution, optimisation and data flowRuns SQL queries, inc filter, group, sort, joinSome ETL for data format translationJava API to submit queriesCommand line tools
  32. 32. Building MapReduce Jobs
  33. 33. Avro – avro.apache.orgLanguage neutral data serializationRich data structures (JSON based)Compact and fast binary data formatCode generation optional for dynamiclanguagesSupports RPCData includes schema details
  34. 34. Avro – avro.apache.orgSchema is always present – allows dynamictyping and smaller sizesJava, C, C++, C#, Python, Perl, Ruby, PHPDifferent languages can transparently talk toeach other, and make RPC calls to eachotherOften faster than Thrift and ProtoBufNo streaming support though
  35. 35. Thrift – thrift.apache.orgJava, C++, Python, PHP, Ruby, Erlang, Perl,Haskell, C#, JS, Cocoa, OCaml and more!From Facebook, at Apache since 2008Rich data structure, compiled down intosuitable codeRPC support tooStreaming is availableWorth reading the White Paper!
  36. 36. HCatalog (Just joined Hive)Provides a table like structure on top of HDFSfiles, with friendly addressingAllows Pig, Hadoop MR jobs etc to easilyread/write data structured dataSimpler, lighter weight than Avro or Thriftbased serialisationBased on Hives metastore formatDoesnt require an additional datastore
  37. 37. MRUnit – mrunit.apache.orgBuilt on top of JUnitChecks Map, Reduce, then combinedProvides test drivers for HadoopAvoids you needing lots of boiler plate codeto start/stop HadoopAvoids brittle mock objectsHandles multiple input K/VsCounter checking
  38. 38. MRUnitIdentityMapper – same input/outputpublic class TestExample extends TestCase {private Mapper mapper;private MapDriver driver;@Beforepublic void setUp() {mapper = new IdentityMapper();driver = new MapDriver(mapper);}@Testpublic void testIdentityMapper() {// Pass in { “foo”, “bar” }, ensure it comes back againdriver.withInput(new Text("foo"), new Text("bar")).withOutput(new Text("foo"), new Text("bar")).runTest();assertEquals(1, driver.getCounters().findCounter(“foo”,”bar”));}}
  39. 39. Hadoop Development Tools (Incbtr)Set of Eclipse plugins for HadoopWork with multiple versions of Hadoop fromwithin the same EclipseQuite a new project, lot still “TODO”Makes it easy to build Hadoop programsQuery the clusterSubmit jobs, then track themRemote debug jobs
  40. 40. Oozie – oozie.apache.orgWorkflow, scheduler and dependencymanager for Hadoop jobs (inc Pig etc)Define a workflow to describe the data flowfor your desired outputOozie handles running dependencies asneeded, and scheduled execution of stepsas requestedBuilds up a data pipe, thenexecutes as required on a cloud scale
  41. 41. BigTop – bigtop.apache.orgBuild, Package and Test code built on top ofHadoop related projectsAllows you to check that a given mix of sayHDFS, Hadoop Core and ZooKeeper workwell togetherThen package a tested bundleIntegration testing of your stackTest upgrades, generate packages
  42. 42. Ambari (Incubating)Monitoring, Admin and LifeCyclemanagement for Hadoop clusterseg HBase, HDFS, Hive, Pig, ZooKeeperDeploy+Configure stack to a cluster ofmachinesUpdate software stack versionsMonitoring and Service AdminREST APIs for cluster management
  43. 43. For the Cloud
  44. 44. Provider Independent Cloud APIsLets you provision, manage and query Cloudservices, without vendor lock-inTranslates general calls to the specific (oftenproprietary) ones for a given cloud providerWork with remote and local cloud providers(almost) transparently
  45. 45. Provider Independent Cloud APIsCreate, stop, start, reboot and destroyinstancesControl whats run on new instancesList active instancesFetch available and active profilesEC2, Eycalyptos, Rackspace, RHEV,vSphere, Linode, OpenStack
  46. 46. LibCloud – libcloud.apache.orgPython library (limited Java support)Very wide range of providersScript your cloud servicesfrom libcloud.compute.types import Providerfrom libcloud.compute.providers import get_driverEC2_ACCESS_ID = your access idEC2_SECRET_KEY = your secret keyDriver = get_driver(Provider.EC2)conn = Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)nodes = conn.list_nodes()# [<Node: uuid=..., state=3, public_ip=[],provider=EC2 ...>, ...]
  47. 47. DeltaCloud – deltacloud.apache.orgREST API (xml) + web portalMajor Cloud Providers, RHEV-M, vSphere<instances><instance href="" id=inst1><owner_id>larry</owner_id><name>Production JBoss Instance</name><image href=""/><hardware_profile href=""/><realm href=""/><state>RUNNING</state><actions><link rel="reboot" href=""/><link rel="stop" href=""/></actions><public_addresses><address></address></public_addresses><private_addresses><address>inst1.larry.internal</address></private_addresses></instance></instances>
  48. 48. Provisionr (Incubating)Higher Level than DeltaCloud or LibCloudSemi-Automated workflows to provision“Build me a cluster on these assumptions”Cloud provider independentBuilds on Puppet, lets you work higher levelHandles retries, throttling etcCan work on multiple clouds
  49. 49. CloudStack – cloudstack.apache.orgSoftware to run an IaaS cloud computingplatform (on demand, elastic etc)Can run public, private or hybrid cloudsEntire stack to run an IaaSHypervisors: VMware, KVM, XenServer, XCPWeb interface, CLI or RESTful to manageAPI compatible with EC2 for hybrid
  50. 50. Servers for Big Data
  51. 51. ZooKeeper – zookeeper.apache.orgCentralised service for configuration, namingand synchronisationProvides Consensus, Group Mangementand Presence trackingSingle co-ordination service across all thedifferent componentsZooKeeper is distributed and highlyreliable (avoid config being SPOF)
  52. 52. ZooKeeper –“A central nervous system for distributedapplications and services”Bindings for Java, C, Perl, Python, Scala,.Net (C#), Node.js, ErlangApplications can read nodes, send events,and watch for themeg fetch config, come up, perform leaderelection, share active list
  53. 53. Curator (Incubating)“A Zookeeper Keeper”Provides implementations of common thingsyou want to do using ZookeeperImproved API on top of Zookeeper, eg withretries, error cases, cleaner APITest support – in memory ZK server, clusterProvides SOA service discoverybuilt using Zookeeper
  54. 54. Airavata – airavata.apache.orgToolkit to build, manage, execute and monitorlarge scale applicationsWorkflow driven processingHistorically aimed at Scientific Processing,but now expandingFacilities for developer to deployEnd user launches, workflow manages, thenable to monitor eg via widgets
  55. 55. Whirr - whirr.apache.orgGrew out of HadoopAimed at running Hadoop, Cassandra,HBase and ZooKeeperHigher level – running services not runningmachinesJava (jclouds) and Python versionsCan be run on the command line
  56. 56. ACE – ace.apache.orgSoftware Distribution FrameworkBuilt on OSGi, but can work for anything“Puppet for OSGi”Tracks and installs components and depsGroup components, set targets to deploy toCan do QA, staging, production flowWeb based management
  57. 57. River – river.apache.orgToolkit for building distributed systemsCommunicate over JERI, which works overTCP, SSL, HTTP, HTTPS, Kerberos, JRMPCan do dynamic discoveryUses JINI for naming/addressingNot as fully featured as ZooKeeperBetter for JINI types?
  58. 58. Helix (Incubating)Cluster Management FrameworkState Machine based, describe in terms oftransitions and statesFor partitioned, replicated and distributedresources across clusters of nodesSets up the nodes in the clusterSystem tasks, eg maintenance, backups“Global Brain for the system”
  59. 59. Knox (Incubating)Secure Gateway to a Hadoop ClusterPerimeter Security & AuthenticationSingle endpoint for users to connect toKnox Gateway sits in DMZ, provides thechoke-point for cluster accessHadoop sites behind, protectedAllows users accessAllows admin access
  60. 60. Mesos (Incubating)Like virtualisation on top of your clusterResource Sharing and IsolationHadoop, MPI, Hypertable, SparkCan run multiple copies of each, eg production,ad-hoc and testing versions of HadoopCan manage long-running (eg HBase) at thesame time as batchZookeeper based
  61. 61. Building out your Solution
  62. 62. Tika – tika.apache.orgText and Metadata extractionIdentify file type, language, encodingExtracts text as structured XHTMLConsistent Metadata across formatsJava library, CLI and Network ServerSOLR integrationHandles format differences for you
  63. 63. UIMA – uima.apache.orgUnstructured Information analysisLets you build a tool to extract informationfrom unstructured dataComponents in C++ and JavaNetwork enabled – can spread work outacross a clusterHelped IBM to win Jeopardy!
  64. 64. OpenNLP – opennlp.apache.orgNatural Language ProcessingVarious tools for sentence detection,tokenization, tagging, chunking, entitydetection etcUIMA likely to be better if you want a whole-solutionOpenNLP good when integratingNLP into your own solution
  65. 65. cTakes – ctakes.apache.orgClinical Text Analysis & Knowledge ExtractionBuilds on top of UIMA and OpenNLPExtracts information from free text in electronicmedical records (and OCRd paper)Identifies named entities from common orcustom dictionaries (eg UMLS)For each, produce attributes, egsubject, mapping, context
  66. 66. MINA – mina.apache.orgFramework for writing scalable, highperformance network apps in JavaTCP and UDP, Client and ServerBuild non blocking, event driven networkingcode in JavaMINA also provides pure Java SSH, XMPP,Web and FTP servers
  67. 67. Etch – etch.apache.orgFramework for building, producing andconsuming network services.Cross platform, language and transportJava, C#, C, C++, Go, JS, PythonProduce a formal description of the exchange1-way, 2-way and realtime communicationScalable, high performanceHeterogeneous systems
  68. 68. ActiveMQ / Qpid / SynapseMessaging, Queueing and Brokeragesolutions across most languagesDecide on your chosen message format,endpoint languages and messaging needs,one of these three will likely fit your needs!Queues, Message Brokers, EnterpriseService Buses, high performanceand yet also buzzword compliant!
  69. 69. Kafka (Incubating)Distributed push-subscribe (pub-sub)messaging systemAllows high throughput on one system,partitionable for manyMore general than log based systems such asFlumeSupports persisted messages, clients can catchupDistributed, remote sources and sinks
  70. 70. S4 (Incubating)Simple Scalable Streaming SystemPlatform for building tools to work oncontinuous, unbounded data streamsStream is broken into events, which arerouted to PEs and processedUses Actors model, highly concurrentPEs are Java based, but event sources andsinks can be in any language
  71. 71. Commons – commons.apache.orgCollection of libraries for Java projectsSome historic, many still useful!Commons CLI – parameters / optionsCommons Codec – Base64, Hex, PhoneticCommons Compress – zip, tar, gz, bz2Commons Daemon – OS friendly start/stopCommons Pool – Object pools (db, conn etc)
  72. 72. JMeter – jmeter.apache.orgLoading testing toolPerformance test network servicesDefined a series of tasks, execute in parallelTalks to Web, SOAP, LDAP, JSM, JDBC etcHandy for checking how external resourcesand systems will hold up, once abig data system start to makeheavy use of them!
  73. 73. Chemistry – chemistry.apache.orgJava, Python, .Net, PHP, JS, Android, iOSinterface to Content Management SystemsImplements the OASIS CMIS specBrowse, read and write data in your contentrepositoriesRich information and structureSupported by Alfresco, Microsoft, SAP,Adobe, EMC, OpenText and more
  74. 74. ManifoldCF – manifoldcf.apache.orgFramework for content (mostly text)extraction from content repositoriesAimed at indexing solutions, eg SOLRConnectors for reading and writingSimpler than Chemistry, but also works forCIFS, file systems, RSS etcExtract from SharePoint, FileNet,Documentum, LiveLink etc
  75. 75. Questions?
  76. 76. Thanks!Twitter - @GagravarrEmail – nick.burch@quanticate.comThe Apache Software Foundation: projects list: