Stock market prediction data set

Upload fileI recently read an interesting research paper by Johan Bollen, Huina Mao, Xiao-Jun Zeng, from Indiana University entitled “Twitter mood predicts the stock market, ” that investigated whether “collective mood states derived from large-scale Twitter feeds” correlated with the value of the Dow Jones Industrial Average. What they found was that their algorithm not only paralleled market changes, it predicted them, with startling 87.6 percent accuracy!

As a provider of Big Data analytics software, we see this type and scale of problem all the time at our customer sites, particularly the correlation of structured and unstructured data. For this particular study, let’s see how easy it is to reproduce this analysis with Datameer Analytics Solution (DAS).

First, let’s download the Dow Jones stock values data. You can get this freely, from Yahoo for example (DJIA). This is a simple CSV file format showing daily prices. You can also download other data, such as the NYSE Composite index, to experiment with.

Second, let’s get some Twitter data from their API, known as the “fire hose”. For this test, we’ll use raw data (i.e. unfiltered tweets) for the entire month of March 2010.

Let’s load all of this data into DAS. In our new 1.3.x version, you can simply upload a file from your local computer, so let’s load our Dow Jones data this way:

Then let’s load the tweets, via an Import Job, which understands Twitter’s format natively:

This amounts to about 30 GBs of compressed data for the month.

Let’s first try and get a more accurate data set, by filtering the tweets to US users. This is something that our researchers apparently did not do: “we note that our analysis is not designed to be limited to any particular geographical location”, but this is easy to do with DAS.

We did not have OpinionFinder nor Google-Profile of Mood States at our disposal to perform sentiment analysis (these could make great new functions some day that could be added via our API!), so let’s use instead a simplified version by taking a list of positive terms (Bag of words model), and find the tweets that contain these terms.

To do this in DAS, let’s import a list of such terms (this can be easily found on different web sites), and create an outer join with our tweets, and then filter to find the tweets that contain these positive words.

Self-Regulatory Organizations; ICE Clear Europe Limited; Order Granting ..  — Insurance News Net
ICE Clear Europe states that the changes to the Risk Policy amend the calculation of CDS initial margin requirements to comply with margin requirements under EMIR Article 41 and Article 24 of the implementing Regulatory Technical Standards. /5/ ICE ..

New Classics Library Socionomics: The Science of History and Social Prediction
Book (New Classics Library)
  • Used Book in Good Condition
Related Posts