Almost 100 years ago, Sir Arthur Conan Doyle, creator of Sherlock Holmes, wrote about the importance of information, data and how to analyze them scientifically. In the story “Scandal in Bohemia” the author wrote:

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

It is impossible to separate the impact that information generated on the interet, its conectivity, and the way mobile devices have from the analysis of the data generated on these spaces.

Data Scraping

Right now, information is the most valuable resource avialable to mankind. Understanding the data and the evidence that is revealed by it is one of the most useful skills that a person can have in the Information Era.

At NOSTRODATA we like to use the concept of “Data Sciences” in the most strict interpretation of it and we know that in order to do science the first step is to observe a phenomenon that can be quantifiable. Big Data is the term used in Data Sciences to describe data sets that can be analyzed to find trends, patters and its relationship to human behaviour.

It is possible to say that threre are two ways in wich the data sets can be collected: Data scraping and data mining. The difference between these is that data scraping can be done manually and data mining is done by robots specially programed to do so without human supervision.

As an example of data scraping we made CSV files from the transcripts of the morning news conferences by the president of Mexico, Andrés Manuel López Obrador. The objective is to make available the transcripts to the data science community and the public from their beginning in December 2018.

Each transcript is separated by participant. And as an example of the type of analysis that can be done with them we made a couple of word maps for each speaker.

The repository with CSV files can be found here:

https://github.com/NOSTRODATA/conferencias_matutinas_amlo

The dataset of the morning news conferences are a clear example of how data can be used to observe wich topics are of importance during each conference. At NOSTRODATA we are interested in the participation of the community by doing their own analysis so we published a repository with the database so everyone can acces to them.

If you’re interested in collaborating with NOSTRODATA, contact us by clicking here.