Data analysis using the CouchDB database. Cécile Kéfélian - PDF

Data analysis using the CouchDB database Cécile Kéfélian + = KSETA Freudenstadt workshop, 17/10/2013 Goal of the workshop Getting an overview of the CouchDB database and its usefulness for monitoring and

Please download to get full document.

View again

of 42
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Jobs & Career

Publish on:

Views: 16 | Pages: 42

Extension: PDF | Download: 0

Data analysis using the CouchDB database Cécile Kéfélian + = KSETA Freudenstadt workshop, 17/10/2013 Goal of the workshop Getting an overview of the CouchDB database and its usefulness for monitoring and data analysis What is CouchDB? What are its benefits? How to get informations for it using views? How to handle Big Data problem? How to use CouchDB from a python script using couchdbkit? What interesting features are offered by CouchDB? CouchApps Replication 2 Introduction to CouchDB CouchDB: Cluster Of Unreliable Commodity Hardware DataBase Definition from official website: CouchDB is an open source document-oriented database. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. open source NoSQL schema-free document-based CouchApps key/value pairs N-Master replications views revision map/reduce json-based RESTfull API Infinite applications: films, sms, contacts, cooking recipes, web apps, blogs, websites... monitor the detector temperature, store analysis results 3 Installation of couchdb pre-compiled binaries for all platforms available available on all operating systems also on exotic ones ;) 4 How to administrate the CouchDB database From creation to replication to data insertion, CouchDB administration can be done via HTTP CouchDB is a RESTful API the 4 HTTP methods GET,POST,PUT and DELETE can be used Terminal + command line utility to throw around HTTP requests (like curl) Futon (web build-in administration interface) load Futon in your browser: Not working with internet explorer Firefox or chrome advised 5 Admin rights By default, CouchDB gives every user admin rights on all databases. 6 MySQL vs NoSQL database SQL (Structured Query Language): programming language designed for managing data held in a relational database. NoSQL CouchDB, MongoDB MySQL Support the SQL Relational database (collection of tables of data items, described and organized according to the relational model) Do not support the SQL Document-based database Collection of tables of data items to be defined up-front Relationship between tuples have to be defined Collection of self-contained documents which can differ from each other (document not stored in a defined table) Specific protocol used to communicate with the db up-front defined structure No relationships have to be defined HTTP protocol used to communicate with the db schema-free Let's create a database containing the list of the films you watched :) 7 Creating a database 8 Creating a document Database empty at the moment... functions available Security: define admins and members After clicking on new document... _id field and corresponding value created automatically unique value identifying the document 9 Document structure ==key key / value pair structure Field (=key) string Value JSON (JavaScript Object Notation) object : number (either integer or float) string boolean (true/false value) array object (a set of key/value pairs) Use to format the content and structure of the data and responses CouchDB also supports attachments. 10 Document revision After saving... new field _rev automatically added 11 Document revision Each time the document is modified (key/value pair added or modified) and saved, a new _rev value is given to the document revision _rev changes after saving adding a new field correction of a value previous version accessible! All the document revisions can be deleted by clicking on Compact&Cleanup 12 JSON-based document storage After clicking on source: Futon interprets the key/value pairs as JSON objects. By clicking on source, the underlying JSON document is displayed 13 Schema-free database We can add a document with different key/value pair in the films database documents of a given db do not necessarily have the same structure attachments possible in a document :) :) :) 14 Useful JSON syntax Value JSON (JavaScript Object Notation) object : number (either integer or float) string boolean (true/false value) array Which syntax should be use for the interpreter set of key/value pairs to recognize the object type? JSON arrays: actors : [ Penelope Cruz , George Clooney ], JSON set of key/value pairs: { chocolate : 150, flour : 80, sugar : 100, butter : 80, eggs : 4, coconut : 80, backing powder : 1 } 15 open source NoSQL schema-free document-based CouchApps key/value pairs N-Master replications views revision map/reduce json-based RESTfull API Now, let's... Get informations from the database 16 Getting useful informations using views CouchDB is schema-free i.e. unstructured by nature difficult to use in real-world applications Use views to give structure to the data Two kind of view existing: permanent views (stored in design document): iterate over every document and build a list of documents with specific fields improve the performance temporary views: executed on command but ressource-intensive and become slower as the amount of data stored in the db increases Views based on Map/Reduce principle and using JavaScript functions Views are not updated after a document is saved but when it is run first run can last long if there are many documents in the db 17 Getting useful informations using views go to temporary view to write a view Views based on the Map/Reduce principle reduce data aggregation map extracting data simplest map function After clicking on Run, view output: set of keys and values passed to it and combined to a single value predefined reduce functions (_sum, _count and _stats) 18 Map function syntax Considering a document of the following form: { _id : f3afc a09b6dabeeb3cb000f1e , _rev : 7-0f60974d9a81d92caa8c4ee13285c104 , string : HelloWolrd , int :5, output1 : Couch, output2 : DB } Example of a map function: function(doc) { if(doc.string && 5 ) emit(doc.output1, doc.output2); } What you should know on map syntax: indent not compulsory doc[key1] = doc.key1 between 2 conditions: && if(doc.key3) = if the document has a field called key3, then continue emit() generates the output 19 Examples of map functions Condition on simple key/value pair function(doc) { if((doc.preparationtime+doc.cookingtime) 20) emit(, doc.ingredients); } Corresponding output: Condition on object elements: function(doc) { if(doc.ingredients[ tomato ] 0) emit(, doc.ingredients); } Corresponding output: To view recipes with cooking+preparation time 20 min To view recipes using tomatoes { pastry brisée : 1, oignon : 7, egg : 5, lardon : 250, butter : 25, 20 liquid creme : 20 } Examples of map functions Condition on table of values: In the document: Underlying JSON doc: actors : [ Penelope Cruz , George Clooney ], We need the following map function to query an actor: function(doc) { for(var i=0;i doc[ actors ].length;i++){ if(doc[ actors ][i]== penelope Cruz ) emit(doc.title, doc.actors); } } Corresponding output: 21 Using CouchDB for physics purposes 22 Using CouchDB in Physics We can store: DAQ informations: run configuration... Slow control (temperature, pressure...) Hardware maps Informations on detectors Energy resolution Noise spectra/filters Analysis results Physics application often requires fast-growing db Problem: CouchDB do not offer horizonal scaling i.e. across many servers Solution: Going from your small Couch... to the BigCouch BigCouch was released and is primarily maintained by Cloudant Based on CouchDB BigCouch allows to create clusters of CouchDBs that are distributed over many servers but appears to the user as one CouchDB instance All the CouchDB servers act together to store and retrieve documents, index and serve views, and serve CouchApps. Cloudant website: 23 CouchDB available frameworks All languages which can deal with HTTP can be used to administrate the db Libraries existing for many languages C++ tools less developed and less convenient than python tools Following examples using the python tool couchdbkit (provide a framework for Python to access and manage Couchdb) 24 Example: position monitoring using CouchDB Position of the muon veto chariots measured every 15 min in text files Measured every 15 min Written in a text file distance est Dest distance nemo Dnemo date + time measurement 5 measurements of the distance EDELWEISS setup WALL 25 Store a text file content in the db ADVICE: Don't store documents individually but create a list of documents and store them in one call # import the couchdbkit librairy which allows the communication with the db import couchdbkit # create an empty list which will be used to store documents docs=[ ] #open the text file containing the useful informations f=open('path/workfile.txt','r') for line in f : #go over the file line #get the date line_list=line.split(' ') date_str=line_list[0].strip(' ') #get the 5 measurement values in a list val_list = [float(x) for x in line_list[3].split(',')] #put these values in an array using the numpy package val_np = np.array(val_list) #convert the red date into a time object; be careful of time conversion from your time zone to UTC!!! date=datetime.datetime.strptime(date_str, %y-%m-%d %H:%M:%S ) #convert the date in unixtime (ADVICE: always store the time in unixtime) unix_time=time.mktime(date.timetuple()) 26 Store a text file content in the db #create an empty dictionary to store the document adoc={ } adoc['avevalue'] = val_np.mean() adoc['uncervalue'] = math.sqrt(val_np.var()) adoc['unixtime'] = unix_time adoc['position'] = 'est' Python document == CouchDb document! #append the document in the docs list Dictionaries consist of pairs of keys and their docs.append(adoc) corresponding values (like in a db document!) dict = {'Name': 'Antoine', 'Age': 23, 'Institut':'LASIM'}; #once all the file lines haves been red and informations put in a dictionary, #call the database and them the list of document: # connect to the cloudant server s = # create a database with the name db_name from a python script db = s.create_db('dbname') # call an existing db db = s('vetopos') # or create a db if non existing, get the existing one otherwise db = s.get_or_create_db(dbname) #save the list of documents append to docs db.bulk_save(docs) #to save a single document #db.save_doc(adoc) 27 Data selection by using the database Communication with our Cloudant database Rate per 1 min bin Requirements for a correct analysis for each event of the root tree closed muon veto HV ON for the whole system gap 4.6 cm: data skipped gap 4.6 cm 68.9 days 48 days Rate per 1 min bin Time some hv off: data skipped all HV on 48 days 45.2 days M15 / M16 off Change of HV channel Time 28 Using views in a python script Problem: accessing the database for each event is time consuming Solution: copy the database documents useful the for analysis in a local dictionary Before to perform the analysis: #create empty lists DocListEst=[ ] DocListNemo=[ ] #to store the position of the est part #to store the position of the west part # connect to the cloudant server s = #get to the db called vetopos db=s['vetopos'] Beginning of the analysis End of the analysis Disabled eventual reduce function #select the view to get position of the est chariot vr=db.view('app/est_bydate',startkey=starttime,endkey=endtime,reduce=false) #loop over the document of the view for row in vr: #store the useful fields in the dictionary DocListEst.append({'PcTime':doc['unixtime'],'Position':doc['aveValue']}) vr2=db.view('app/nemo_bydate',startkey=starttime,endkey=endtime,reduce=false) #loop over the document of the view for row in vr2: #store the useful fields in the dictionary DocListNemo.append({'PcTime':doc['unixtime'],'Position':doc['aveValue']}) 29 Reducing time consumption while reading documents If there is only few change of the useful value during the time studied time period (for example of hv): create a reduced dictionary saying when the value changed and the new value #ensure the Docs dictionary is sorted by time Docs.sort(key=lambda x: (x['pctime'])) #create a reduced list containing the time and new value ReducedList={ } ReduceList.append({ time =Docs[0][ time ], value :Docs[0][ value ]}) valueref=docs[0].get('value') #loop over the documents in Docs for item in Docs: if item['value']==valueref: print 'no change, don't append the document!' else: print 'the value has changed. Append the document!' ReducedList.append({ time =Docs[0][ time ], value :Docs[0] [ value ]}) 30 Reducing time consumption while reading documents If many changes of the useful value: save the list item index of the document in which for Entry in range (0,t.GetEntries()): f.getentry(entry) for index in range(save_index,len(docs)): if event.getpctimesec() =(docs[index]['time']) and event.getpctimesec() (docs[index+1]['pctime']): save_index=index Do stuff 31 More example : delete document from the db connect to the server s = #select the corresponding db db = s['vetopos'] #select a view vr=db.view('app/nemo_bydate',reduce=false) #function(doc) { # if(doc.unixtime && doc.position== est && typeof(doc.avevalue) === 'number') # emit(doc.unixtime, doc.avevalue); #} for(line in vr): if row['key'] and row['key'] : db.delete_doc(row['id']) 32 Database utilities 33 Database replication Replication: synchronization of 2 copies of the same database, allowing easy access to data The databases can live on the same or different servers. If one copy of the database is changed, replication will send these changes to the other copy. To do a replication, the user sends an HTTP request to CouchDB that includes a source and a target database, and CouchDB will send the changes from the source to the target. Simple replication from the Futon interface 34 CouchApps CouchApp: standalone web application (based on HTML and JavaScript) that can be entirely self-contained in a design document within the database that provides the data 35 CouchApp example 36 CouchApp example 37 Documentation Official wiki page Official apache couchdb website Online book dedicated to CouchDB Short video tutorial Couchdbkit toolkit About views: About predefined reduce functions: Couchapps toolkit Nice tutorial on couchapps 38 Backup slides 39 Getting informations using HTTP Mind the? between url and parameters 40 HTTP parameters 41 Predefined reduce functions _sum just adds up the emitted values, which must be numbers. _count counts the number of emitted values. (It's like _sum for emit(foo, 1).) It ignores the contents of the values, so they can by any type. _stats calculates some numerical statistics on your emitted values, which must be numbers. The reduce output is an object that looks like this: { sum :2, count :2, min :1, max :1, sumsqr :2} sum and count are equivalent to the _sum and _count reductions. min and max are the minimum and maximum emitted values. sumsqr is the sum of the squares of the emitted values (useful for statistical calculations like standard deviation). 42
Related Search
Similar documents
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks